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Abstract 

Experiments  have  shown  that  reasonable  amounts  of  fine  grain  parallelism  are  available  globally  in  serial 
programs.  New  compaction  algorithms,  as  well  as  a  new  architecture,  have  been  put  forward  to  exploit  such 
parallelism. 

Due  to  its  global  nature,  ordinary  local  compaction  techniques  are  inadequate  to  fully  extract  the  fine 
grain  parallelism  available  in  serial  programs.  Three  global  compaction  algorithms  are  presented: 

1.  Trace  scheduling  compacts  the  most  probable  execution  paths  of  a  program,  using  some  local  com- 
paction algorithm.  Compensation  code  is  inserted  to  preserve  semantic  equivalence  with  the  original 
program. 

2.  Percolation  scheduling  is  based  on  a  series  of  elementary  semantic-preserving  program  transformations 
which  are  repeatedly  applied  using  higher  level  guidance  rules. 

3.  Region  scheduling  evenly  redistributes  the  parallelism  available  throughout  a  program,  taking  into 
account  machine  capabilities.  Parallelism  redistribution  is  again  achieved  via  repetitive  application  of 
elementary  program  transformations. 

Global  compaction  algorithms  alone  are  insufficient  to  speedup  loops  effectively,  since  after  compaction 
loop's  iterations  are  still  executed  serially.  A  new  compilation  technique,  software  pipelining  (the  software 
version  of  hardware  pipelining),  overcomes  this  problem  by  overlapping  successive  iterations  as  much  as 
possible.  Several  software  pipelining  algorithms  are  presented,  some  achieving  optimal  or  quasi-optimal 
overlapping  in  the  absence  of  resource  constraints. 

The  idea  behind  software  pipelining  algorithms  is  to  (explicitly  or  implicitly)  determine  the  (fixed  or 
variable)  time  interval  which  must  be  respected  before  starting  the  execution  of  a  new  iteration.  Most  of  the 
algorithms  create  a  rigid  pipeline,  that  is  iterations  are  scheduled  when  the  compiler  is  certain  that  they  will 
execute  without  interruptions.  More  flexible  pipelining  algorithms  are  also  presented. 

The  above  compilation  techniques  have  been  developed  for  a  specific  type  of  architecture  termed  very  long 
instruction  word  (VLIW)  architecture.  V'LIVVs  vaguely  resemble  horizontal  microengines,  but  their  structure 
is  cleaner  and  their  instruction  set  more  orthogonal. 

Three  sorts  of  VLIW  machines,  based  on  three  different  computational  models  are  presented.  All  three 
computational  paradigms  allow  multiple  operations  and  multiple  conditional  branches  per  machine  instruc- 
tion. What  distinguishes  the  models  is  the  way  multiple  conditional  branches  are  handled. 

The  first  model  was  specifically  designed  to  accommodate  trace  scheduling:  semaiitically  multiple  condi- 
tional jumps  are  organized  like  in  a  Lisp  cond  statement. 

The  second  model,  used  by  percolation  scheduling,  allows  multiple  conditional  jum|«  to  have  the  structure 
of  an  arbitrary  decision  tree. 

The  Icist  model  extends  percolation  scheduling  by  allowing  conditional  register  assignments  to  be  per- 
formed within  a  decision  tree.  This  model  is  used  to  generate  flexible  software  pipelines. 
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1      Introduction 

1.1  Disadvantages  of  Conventional  SIMD/MIMD  Architectures 

Multiprocessors  and  vector  machines  are  the  most  common  types  of  parallel  architectures  to  date.  Porting 
sequential  applications  to  such  architectures  involves  either  a  total  program  rewrite  (something  the  users  are 
reluctant  to  do)  or  a  recompilation  of  existing  applications  with  parallelizing  compilers. 

The  first  approach  provides  the  biggest  opportunity  for  speedup,  since  one  can  design  a  parallel  program 
suited  to  the  architecture  at  hand.  This  is  not  a  painless  activity,  as  it  often  involves  a  thorough  knowledge 
of  machine  idiosyncrasies  (for  instance  when  compiling  for  the  Cray-1  the  programmer  must  be  aware  that 
vector  registers  are  64  elements  large).  Users  resort  to  this  approach  when  the  running  time  of  a  program  is 
critical. 

Acknowledging  a  user's  reluctancy  to  recode  working  applications,  parallelizing  compilers  try  to  auto- 
matically extract  coarse  grain  parallelism  from  existing  (serial)  applications.  Compilers  for  multiprocessors 
[BurkeSS]  must  minimize  processor  communication  and  synchronization,  while  devising  a  fair  distribution  of 
the  workload  (load  balancing).  This  requires  the  compiler  to  explore  serial  programs  in  search  of  relatively 
independent  sections  of  control  and  data.  Compiling  for  vector  machines  [AllenST],  requires  to  look  for  large 
data  aggregates  (mainly  arrays)  which  are  transformed  by  simple  operations  like  add  and  multiply. 

Parallelizing  compilers  are  not  as  effective  as  programmers  in  their  program  transformation  task.  A 
parallelizing  compiler  has  to  rely  on  user  assertions  and/or  source  code  modifications  to  improve  the  quality 
of  the  code  it  generates.  Furthermore  when  a  user  is  unable  to  restructure  the  source  code  in  a  way  that  the 
compiler  can  exploit,  part  of  the  code  runs  serially  and  machine  resources  are  wasted.  This  can  potentially 
imply  a  poor  overall  speedup.  Less  user  intervention,  and  better  parallel  code  quality  are  obtained  through 
thorough  program  analysis,  this  includes  thorough  interprocedural  analysis,  which  is  usually  hard  to  obtain. 

At  any  rate,  whether  manual  or  automatic  parallelization  for  vector  machines  or  multiprocessors  requires 
matching  the  coarse  structure  of  a  program  to  that  of  the  hardware.  This  has  revealed  to  be  Cjuite  a  convoluted 
task. 

1.2  VLIW  Architectures 

Experiments  done  in  the  context  of  numerical  programs  [Riseman72,Nicolau84b],  have  shown  that  vast 
quantities  of  fine  grain  parallelism  are  available  in  serial  applications.  This  type  of  parallelism  seems  easy  to 
exploit  in  a  parallel  environment  and  researchers  [Arvind80,Fislier84c,Carnevali86,Ebcioglu88]  have  suggested 
its  use  over  coarse  grain  parallelism. 

There  are  two  distinct  schools  of  thought  for  the  use  of  fuie  grain  parallelism.  One  atlvocatcs  its  exploita- 
tion at  run  time  (dataflow  languages  and  architectures),  the  other  its  automatic  extraction  at  compile  time. 
Compile  time  extraction  of  fine  grain  parallelism  is  the  subject  of  this  survey  paper.  For  a  survey  of  the 
dataflow  approach  see  [Veen 86]. 

What  characterizes  fine  grain  computations  is  the  availability,  at  any  given  time,  of  a  good  number  of 
simple,  but  irregular  operations  acting  upon  independent  data.  This  suggests  an  architecture  with  indepen- 
dent functional  units  arranged  in  a  way  such  that  compile  time  scheduling  of  fine  grain  operations  is  possible 
and  entails  practically  zero  run  time  overhead.  The  new  architecture  introduced  to  meet  these  demands  is 


inspired  from  horizontally  microcoded  engines.  The  archetype  architecture  (see  figure  1)  comprises  a  number 
of  integer/floating  ALUs,  one  or  several  register  files,  data  memory  and  a  central  control  unit.  All  machine 
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Figure  1:  General  VLIW  architecture. 


resources  are  driven  by  the  same  clock,  and  are  connected  using  some  sort  of  topology.  The  instructions 
which  operate  each  individual  unit  are  gathered  in  a  rerj/  long  instruction  word  whose  size  can  go  from  100 
to  more  than  1,000  bits  depending  on  the  type  and  number  of  machine  resources.  The  architecture  has  been 
termed  very  long  instruction  word  (VLIW)  architecture.  In  first  approximation  a  very  long  instruction  word 
is  similar  to  a  horizontal  microcode  instruction,  except  that  it  is  much  longer  and  much  more  orthogonal. 
Like  in  horizontal  microcode,  VLIW  instructions  are  the  last  layer  in  the  software  hierarchy:  there  is  no 
additional  level  of  interpretation. 

VLIW  machines  can  be  assimilated  to  synchronous  MIMDs  [EbciogluSS].  In  fact  let  M  be  an  ideal  shared 
memory  synchronous  MIMD  machine  containing  p  RISC-like  processors  executing  an  instruction  every  cycle. 
Suppose  the  p  processors  execute  p  different  programs  comprising  nj , . . . ,  rip  instructions  respectively.  Then 
M  can  be  simulated  by  a  VLIW  machine  with  complex  conditional  branching,  that  is  a  machine  which  can 
branch  to  any  of  2^  labels  given  a  set  of  p  independent  tests.  The  synchronous  MIMD  programs  can  be 
translated  into  a  VLIW  program  with  at  most  n^  ■  . .  .  ■  Up  instructions.  Conver.sely  a  VL1\\'  machine  with 
complex  conditional  branching  can  be  trivially  simulated  by  asynchronous  MIMD. 

The  key  point  of  VLIW  architectures  is  the  symbiosis  between  hardware  and  software:  every  machine 
resource  is  completely  and  independently  controlled  by  the  compiler,  that  is: 


•  resources  are  tightly  coupled,  there  is  a  single  instruction  stream  initiating  a  set  of  fine  grain  operations 
every  cycled 

•  every  operation  executed  by  a  machine  resource  takes  a  predictable  amount  of  time;  functional  units 
may  be  pipelined,  but  there  are  no  hardware  interlocks,  the  compiler  has  to  take  into  account  possible 
data  dependences  and  conflicts  when  scheduling  program  operations, 

•  all  communications  between  machine  units  are  completely  orchestrated  by  the  compiler:  the  source 
and  destination,  as  well  as  the  time  needed  to  transfer  data  between  resources  are  all  known  to  the 
compiler  which,  based  on  this  information,  can  devise  an  efficient  schedule. 

The  input  to  a  VLIW  compiler  is  some  sort  of  sequential  intermediate  language,  like  triples  or  quadruples, 
that  has  undergone  extensive  conventional  optimization  [Aho86].  The  goal  of  a  VLIW  compiler  is  to  schedule 
every  cycle  as  many  fine  grain  operations  as  there  are  fields  in  a  VLIW  instruction,  in  a  way  such  that  program 
semantics  are  preserved  and  execution  time  minimized.  The  process  of  combining  several  intermediate 
operations  into  a  VLIW  instruction  is  called  compaction,  and  bears  resemblance  with  the  code  compaction 
process  that  takes  place  in  horizontally  microcoded  machines.  Code  compaction  can  be  one  of  two  kinds: 

local  compaction:  is  restricted  to  basic  blocks,  that  is  each  VLIW  instruction  is  built  from  operations  of 
a  same  basic  block, 

global  compaction:   disregards  basic  block  boundaries,  VLIW  instructions  may  comprise  operations  from 
different  basic  blocks. 

Local  compaction  is  generally  inadequate  for  VLIW  machines,  as  there  isn't  enough  fine  grain  parallelism  to 
exploit  within  a  basic  block  [Riseman72]. 

1.3      Historical  Evolution  of  VLIW  Architectures 

VLIW  architectures  evolved  from  research  in  horizontally  microcoded  machines.  Horizontal  microengines 
dispose  of  a  small  number  of  non  uniform  hardware  resources  connected  in  an  irregular  fashion.  The  advantage 
of  horizontal  over  vertical  microarchitectures  is  the  ability  to  execute  several  microoperations  concurrently. 
This  performance  increase  comes  at  the  cost  of  more  complex  programming,  as  several  (possibly  conflicting) 
parallel  activities  have  to  be  specified  for  each  clock  cycle.  To  ease  the  programming  effort  of  horizontal 
microengines,  microcode  compilers  have  been  introduced  to  avoid  hand-coding  of  such  machines:  a  microcode 
compiler  takes  a  vertical  list  of  microoperations  and  compacts  them  into  horizontal  microinstructions. 

Till  the  late  70's  microcode  compaction  was  essentially  local  (see  [LandskovSO]  for  a  survey  of  local  mi- 
crocode compaction).  The  local  compaction  problem  (section  3)  is  NP-complete,  but  algorithms  such  as  list 
scheduling  (section  3.2),  inspired  from  job  scheduling  theory  [Coffman76]  very  often  produce  results  which 
are  within  a  few  percent  from  the  time  optimum.  As  horizontal  microengines  became  more  and  more  com- 
plex, allowing  more  concurrent  operations  to  be  specified  every  cycle,  the  need  grew  to  employ  more  global 
techniques.  In  1978  Tokoro  et  al.  [Tokoro78]  introduced  one  such  technique,  where  local  compaction  is  first 
applied  to  each  basic  block,  and  then  motion  of  operations  between  adjacent  basic  blocks  is  considered,  as 


'From  here  on  the  word  instruction  refers  to  a  VLIW  instruction,  whereas  the  word  operation  refei-s  to  one  of  the  fine  grain 
operations  composing  a  VLIW  instruction. 


allowed  by  a  list  of  legal  moves.  The  method  remains  inherently  local  as  the  local  compaction  phase  im- 
poses too  many  arbitrary  constraints  which  limit  the  applicability  of  the  global  code  motion  rules.  In  1979 
Fisher  [Fisher79]  introduced  an  interesting  global  compaction  technique,  trace  scheduling  (from  here  on  TS) 
which  selects  the  most  probable  execution  paths  of  a  microprogram  and  locally  compacts  such  paths,  disre- 
garding conditional  jumps  (compensation  code  is  inserted  to  preserve  semantic  equivalence  with  the  original 
program).  It  is  the  adaptation  of  such  technique  to  high  level  programs  that  gave  birth  to  the  field  of  VLIVVs. 

Experiments  done  in  the  early  70s  and  repeated  in  the  early  80s  (section  4.1)  suggest  that  great  amounts 
of  fine  grain  parallelism  are  available  in  ordinary  programs.  In  the  70s  such  parallelism  was  discarded  on 
the  basis  of  its  global  nature  and  the  inability  of  constructing  efficient  hardware  mechanisms  to  exploit  it. 
Fisher  realized  that  TS  could  be  adapted  (section  4.2)  to  extract  a  good  part  of  the  parallelism  globally 
available  in  serial  programs.  To  exploit  such  parallelism  Fisher  and  his  team  designed  the  ELI,  the  first 
true  VLIW  architecture.  The  machine  model  used  in  the  ELI  accommodates  the  needs  of  TS,  by  allowing 
multiple  operations  and  conditional  jumps  per  VLIW  instruction.  The  conditional  jumps  are  organized 
like  in  a  lisp  cond  statement.  The  ELI  project  gave  birth  to  a  company,  Multiflow  Inc.  whose  goal  was  to 
produce  a  commercial  product  based  on  trace  scheduling  and  VLIW  technology.  A  line  of  machines  called 
the  TRACE(rA/)  was  eventually  introduced  in  1988. 

Despite  its  good  results  TS  has  some  problems,  the  biggest  of  whicii  is  the  (|iolenliaily  exponential) 
increase  in  code  size  caused  by  the  insertion  of  compensation  code.  This  is  a  very  acute  problem  in  microcoded 
machines  where  microprogramming  memory  is  of  reduced  size  and  space  is  at  a  premium.  Methods  have 
been  introduced  in  the  microprogramming  world  (section  4.3)  to  overcome  code  explosion  and  some  other 
problems  of  TS. 

As  a  result  of  his  experience  with  TS,  Nicolau  [Nicolau84a]  has  introduced  another  global  compaction 
technique  that  formalizes  the  ideas  behind  TS  (section  4.4).  The  new  global  compaction  technique  is  based  on 
a  series  of  semantic  preserving  elementary  program  transformations,  that  are  repeatedly  applied  using  some 
high  level  guidance  rules.  The  repeated  application  of  such  elementary  transformations  allows  operations  to 
percolate  to  the  beginning  of  a  program.  The  machine  model  that  Nicolau  uses  for  his  global  compaction 
technique  (called  percolation  scheduling)  extends  the  model  used  for  TS,  by  allowing  arbitrary  decision  trees 
in  each  VLIW  instruction. 

The  last  global  compaction  technique  presented  in  this  paper  is  region  scheduling  (section  4.5)  developed 
by  Gupta  in  1987.  Region  scheduling  works  by  evenly  redistributing  the  parallelism  available  throughout  a 
program,  taking  into  account  machine  capabilities.  Like  for  percolation  scheduling  parallelism  is  redistribu(ed 
through  repetitive  application  of  semantic  preserving  elementary  program  transformations  The  elementary 
transformations  are  however  not  defined  in  terms  of  a  program's  flow  graph,  but  rather  a  new  intermediate 
program  representation  paradigm:  the  program  dependence  graph  [Ferrante87].  The  program  dependence 
graph  is  better  suited  than  the  program  flow  graph  for  the  extraction  of  fine  grain  parallelism,  as  no  spurious 
ordering  constraints  are  imposed  on  program  operations. 

The  previous  global  compaction  techniques  are  unable  to  speed  up  loops  significantly  since  after  com- 
paction loop  iterations  are  still  executed  serially.  Once  more  a  technique  used  in  the  microprogramming 
world  [Kogge77]  has  been  adapted  to  VLlWs.    Such  technique,  termed  software  pipelining,  overlaps  loop 


iterations  as  much  as  resource  and  data  dependences  permit.  Simple  software  pipelining  algorithms  have 
been  presented  as  early  as  1981  for  the  FPS-164  (section  5.1)  a  precursor  of  VLIW  machines.  Better  software 
pipelining  algorithms  for  the  FPS  were  given  in  1984  and  1986  respectively  by  Touzeau  and  Su  (section  5.1). 
However  it  is  not  until  recently  (1987/1988)  that  more  sophisticated  loop  pipelining  algorithms  have  been 
introduced.  Lam  (section  5.1.2)  has  presented  an  algorithm  that  can  process  arbitrary  loops,  and  produce 
near  time  optimal  results  for  loops  with  no  IF  statements.  Aiken  and  Nicolau  (section  5.1.4)  have  discovered 
a  polynomial  time  algorithm,  greedy  scheduling  that  produces  time  optimal  results  for  synchronous  MIMD 
machines,  for  loops  without  IF  statements. 

The  previous  algorithms  all  assume  machines  allowing  only  one  conditional  jump  per  VLHY  instruction. 
In  1987  Ebcioglu  has  introduced  a  machine  model  extending  the  percolation  scheduling  model  by  allowing 
multiple  conditional  register  assignments  within  the  decision  tree  of  a  percolation  scheduling  instruction. 
Ebcioglu  employs  such  machine  model  in  conjunction  with  a  pipelining  algorithm  called  pipeline  scheduling 
to  pipeline  inherently  sequential  loops  (section  5.2.1).  In  1988  Aiken  and  Nicolau  have  presented  another 
pipelining  technique  (based  on  the  percolation  scheduling  model)  to  generate  code  for  VLIW  machines  with 
multiple  conditional  branches.  The  technique,  called  perfect  pipelining  has  the  same  effect  as  unbounded 
loop  unrolling  followed  by  compaction. 


2      Program  Dependence  Graph 

Serial  or  parallel  optimization  techniques  need  to  preserve  the  "characteristics"  of  the  initial  program,  more 
precisely  given  a  program  V ,  any  semantic-preserving  program  transformation  has  to  respect  the  partial 
ordering  of  operations  imposed  by  7^'s  daia  and  control  dependences. 

Flow  graphs  [Aho86]  are  the  most  common  way  of  representing  control  dependences.  To  express  data 
dependences  flow  graphs  are  usually  supplemented  with  definition-use  and/or  use-definition  chains  [Aho86]. 

For  parallel  optimizations  data  dependence  graphs  [KuckSl]  (section  2.1)  formalize  the  notion  of  data 
dependence  and  are  a  more  appropriate  representation  than  definition-use  and  use-definition  chains.  Likewise 
control  dependence  graphs  [Ferrante87]  (section  2.2)  are  more  appropriate  than  flow  graphs  to  represent 
control  dependences  as  they  more  faithfully  reflect  a  programs  control  structure  (flow  graphs  imply  an 
excessive  number  of  control  dependences).  By  combining  the  control  and  data  dependence  graphs  one  obtains 
an  intermediate  program  representation,  called  the  program  dependence  graph  [FerranteST],  that  adequately 
reflects  the  essential  characteristics  of  a  program. 

While  data  dependence  graphs  have  been  used  for  more  than  a  decade  and  have  now  become  a  standard 
tool  in  parallel  optimization  techniques,  control  dependence  graphs  have  been  introduced  only  recently. 
Consequently  all  of  the  compaction  techniques  herein  presented,  except  for  region  scheduling  (section  4.5), 
are  formulated  in  terms  of  flow  and  data  dependence  graphs. 

2.1      The  Data  Dependence  Graph 

Let  opi  and  op2  be  two  operations  of  a  program  V,  operation  opo  is  said  to  be  data  dependent  on  operation 
opi  if  there  is  a  path  from  opi  to  opn  in  Vs  flow  graph,  and  one  of  the  two  operations  modifies  a  variable 
that  the  other  operation  uses  or  also  modifies.  The  three  possibilities  are  termed  as  follows: 

-  flow  dependence:  opi  modifies  a  variable  used  in  op2  (write-read  dependence); 

-  anti-dependence:  opi  uses  a  variable  that  op^  modifies  (read-write  dependence); 

-  output  dependence:  both  opi  and  opo  modify  the  same  variable  (write-wrile  dependence); 

Of  the  above  the  only  "true"  dependence  is  flow  dependence,  as  it  is  part  of  an  algorit  hm  data  flow  and  cannot 
be  removed  by  simple  transformations.  Anti  or  output  dependences  arise  because  of  the  programiiier  concern 
to  reuse  storage.  Often  these  false  dependences  can  be  removed  by  copying  or  renaming  [Kuck81,Padua86]. 

The  data  dependence  graph  (DDG)  represents  V's  data  dependences.  The  DDG  is  a  directed  multigraph, 
whose  nodes  are  Vs  operations  (in  what  follows  we  identify  an  operation  with  the  node  representing  it). 
There  is  an  edge  from  opi  to  op2,  if  operation  opn  may  depend  on  opi . 

If 'P's  flow  graph  is  acyclic  the  DDG  is  a  dag  (data  dependence  dag  or  DDD).  When  dealing  with  loops 
some  further  precisions  are  needed.  Operations  contained  in  loops  are  executed  several  times  and  thus 
dependences  can  arise  between  diflTerent  instances  of  the  same  operation,  or  between  different  instances  of 
different  operations.  This  type  of  dependence  is  termed  loop  earned  (or  inter-loop)  dependence  [AllenST]. 
Normal  data  dependences  occurring  between  operations  of  the  same  iteration  are  called  loop  iiidcpendeut 
(or  intra-loop)  dependences.  In  the  context  of  V'LIWs  we  are  mainly  interested  in  loop  carried  dependences 


Even  if  we  are  not  certain  that  op-2  will  depend  on  op\,  we  must  insert  a  dependence  edge,  to  in.sure  coirecl  execution.    A 
"may"  dependence  edge  is  needed,  for  instance,  between  two  operations  that  could  potentially  modify  the  same  variable. 


of  innermost  loops.    Let  L  be  some  innermost  loop  in  V .    We  assume  that  L  has  been  normalized  to  start 
iterating  at   1,  and  has  a  loop  increment  of  1.    Let  op\  and  opo  be  two  operations  occurring  in  L.    If  the 


FOR  i  IN  2   n  LOOP 

x(i)  :=  a(i)  ♦  x(i-l)  +  b(i)  +  x(i-2); 
END  LOOP; 


tl 

=  a(i) 

t2 

=  i-1 

t3 

=  x(t2) 

t4 

=  tl*t3 

t5 

=  b(i) 

t6 

=  i-2 

t7 

=  x(t6) 

t8 

=  t5*t7 

t9 

=  t4+t8 

x(i: 

:=  t9 

(a) 


(b) 


(c) 


Figure  2:    (a)  A  loop,    (b)  The  intermediate  code  for  the  loop  body,    (c)  Tiie  loop  DDG.  Loop  carried 
dependence  edges  are  dashed  (loop  carried  dependences  introduced  by  temporaries  are  ignored). 


execution  of  operation  op2,  during  iteration  j  may  depend  on  the  execution  of  operation  opi  on  iteration  ?', 
I  <  j,  we  create  an  edge  from  opi  to  o^)-:.  labeled  by  the  value  j  —  i.  We  call  j  —  i  the  lag  of  the  loop  carried 
dependence.  For  an  edge  e  we  define  A(e)  to  be  the  minimum  lag  associated  with  e  when  e  represents  a  loop 
carried  dependence,  otherwise  A(e)  =  0.  Loop  carried  dependences  may  introduce  cycles  in  the  DDG.  This 
happens,  for  instance,  when  values  computed  during  an  iteration  are  used  as  inputs  in  subsequent  iterations 
(this  type  of  loop  is  called  a  recurrence). 

As  we  have  previously  said,  each  edge  e  has  attached  a  value  A(e).  When  dealing  with  concrete  machines 
we  also  associate  to  each  edge  e  —  (opi,op2)  the  minimum  delay  that  must  be  respected  between  the  execution 
of  opi  and  opn-  We  call  this  value  (5(e).  Thus  every  edge  e  =  (opi.op^)  hcis  an  associated  tuple  [A(e),  (5(e)], 
whose  mecining  is: 

-  A(e)  =  0:  operation  op2  must  execute  ^(e)  cycles  after  the  execution  of  opi  has  begun; 

-  A(e)  :^  0:  operation  op2  must  execute  fi{e)  cycles  after  the  execution  of  opi  of 

the  A(e)th  previous  iteration  has  begun. 

Note  that  6(opi,op2)  might  be  less  than  the  total  execution  time  of  opi,  as  op\  might  produce  a  result  before 
its  execution  completes. 

As  an  example  consider  the  loop  in  figure  2(a).  The  intermediate  code  for  the  loop  body  is  given  in 
figure  2(b),  the  loop  data  dependence  dag  is  given  in  figure  2(c). 


2.2      The  Control  Dependence  Graph  and  the  Program  Dependence  Graph 

Informally  an  operation  op  is  said  to  be  control  dependent  on  some  test  i,  if  op  can  only  be  executed  as  a 
result  of  following  one  of  t's  branches.  Formally  let  V  be  some  program  and  let  !F  be  its  flow  graph  [T  is 
assumed  to  have  a  unique  exit  node  called  exit,  and  vertices  to  contain  a  single  operation).  An  operation 
opi  of  .7"  is  said  to  be  posi-dominaled  by  an  operation  op2,  if  every  path  from  opi  to  exit  contains  op^.  An 
operation  op  of  JF  is  control  dependent  on  a  test  /  of  ^  if  and  only  if: 

there  exists  a  path  P  from  t  to  op,  such  that  any  operation  in  P  (except  for  t)  is  post-dominated 
by  op,  and  t  is  not  post-dominated  by  op. 

The  control  dependence  graph  (CDG)  of  a  program  expresses  control  dependences  in  the  same  manner  the 
DDG  expresses  data  dependences.  CDG  vertex  set  is  composed  of  T's  operations.  Let  opi  and  op2  be  two 
such  operations.  Then  there  is  an  edge  from  opi  to  opo  in  the  CDG,  if  and  only  if  opo  is  control  dependent 
on  opi  {opi  is  a  test).  Edges  are  labeled  true  or  false,  indicating  the  condition  under  which  an  operation  is 
executed.  See  figure  3(a)  and  (b)  for  an  example. 


(0) 


(b) 


Figure  3:  (a)  Initial  flow  graph.  Dotted  edges  indicate  branches  taken  on  false  conditionals,  (b)  The 
corresponding  CDG.  Edges  labeled  true  are  plain,  edges  labeled  false  are  dotted,  (c)  The  final  CDG  after 
region  nodes  have  been  added.  The  numbers  next  to  a  region  node  indicate  the  predicate  which  must  hold 
for  the  region  to  execute. 


To  summarize  the  set  of  control  conditions  for  an  operation,  and  group  all  the  o(>erations  with  the  same 
set  of  control  dependencies  together,  region  nodes  are  introduced.  In  what  follows,  two  operations  are  said 
to  have  same  control  dependence  predecessors  if  each  has  control  dependence  edges  from  exactly  the  same 
nodes  and  the  corresponding  edges  are  labeled  with  the  same  true  or  false  labels.  Region  nodes  are  created 
as  follows: 


1.  Consider  an  operation  op  which  has  no  region  node  predecessor  in  the  CDG.  Let  Pred  be  tlie  set  of 
control  dependence  predecessors  of  op. 

2.  A  region  node  R  is  created  for  Pred.   Each  operation  whose  set  of  control  dependence  predecessors  is 
Pred  is  made  to  have  the  single  control  dependence  predecessor  R. 

3.  7?'s  set  of  control  dependence  predecessors  is  set  to  be  Pred. 

4.  If  there  exists  a  region  R'  whose  set  of  control  dependence  predecessors  contains  Pred,  R'  is  made  to 
be  directly  control  dependent  on  R  and  the  edges  corresponding  to  Pred  in  R'  are  deleted. 

5.  The  previous  steps  are  repeated  until  step  1  no  longer  applies. 

Essentially,  a  region  node  represents  a  conditional  expression  which,  when  true,  causes  the  operations  de- 
pendent on  it  to  execute  (in  parallel  if  there  are  no  data  dependences).  See  figure  3(c). 

The  program  dependence  graph  (PDG)  combines  the  control  and  data  dependence  graphs  of  a  program. 
The  PDG  vertex  set  comprises  the  program's  operations  and  the  CDG  region  nodes,  the  PDG  edge  set 
is  the  union  of  the  DDG  and  CDG  edge  sets.  While  the  PDG  explicitly  represents  the  data  and  control 
dependences  that  exist  in  a  program  V ,  V's  flow  graph  does  not  have  such  a  desirable  property  as  it  imposes 
a  fixed  sequencing  of  operations  that  need  not  hold  for  an  algorithmically  and  semantically  equivalent  program 

v. 


3      Local  Compaction 

In  the  following  sections  the  local  compaction  problem  (LCP)  is  formally  defined  and  lisl  scheduling,  a  popular 
heuristic  for  performing  local  compaction,  is  presented.  Despite  the  NP-completeness  of  LCP,  list  scheduling 
produces  solutions  which  are  generally  within  a  few  percent  of  the  time  optimum. 

3.1      The  Local  Compaction  Problem 

Informally  the  local  compaction  problem  (LCP)  is:  given  a  basic  block  (that  is  a  set  of  operations  along  with 
their  data  dependences),  and  given  machine  resource  constraints,  the  goal  is  to  schedule  the  operations  in  a 
time-optimal  fashion  such  that  resource  and  data  dependence  constraints  are  respected.  Let  N  be  the  set  of 
naturals,  and  Z  be  the  set  of  integers.  Formally  the  local  compaction  jiroblem  is:  given 

1.  a  machine  M,  a  set  of  resources  71  =  {r^ , .      ,r„,}  that  the  machine  possess, 

2.  a  resource  configuration  vector  i?^  of  N"*,  where  the  klh  entry  of  Rj^^   (denoted  Il^>^{k))  gives  the 
number  of  units  of  resource  r^.  available  in  the  machine  configuration, 

3.  a  set  of  /  operations  O  —  {opi , . .  . ,  opj , . .  . ,  opi) , 

4.  a  duration  function  </  :  O  ^  N,  where  d(opj)  is  the  number  of  machine  cycles  opj  takes  to  execute, 

5.  a  resource  usage  function  Ro  ;  C  x  Z  ^  N"^  (0  is  the  null  vector): 


Ro(opj,x)  =  I 


the  resource  usage  vector  of  the  xth  step  of  operation  op-,      if  0  <  x  <  d(opj) 
0  otherwise 


the  kih  entry  of  Ro{opj,x)  (denoted  Roiopj  ,x){k))  gives  the  number  of  units  of  resource  rj.  needed 
in  the  xth  time  step  of  operation  opj , 

6.  a  data  dependence  dag  DDD=  {0,E)  imposing  a  partial  ordering  on  O  (DDD  embodies  tiie  data 
dependences  of  O's  operations), 

7.  a  delay  function  6  :  E  ^  N,  defined  on  the  edges  of  DDD,  where  for  e  =  {opj^,opj^),  S{€)  is  the  delay 
that  has  to  be  respected  before  scheduling  opj^,  once  opj,  has  been  scheduled  (this  is  the  delay  function 
of  section  2.1). 

The  goal  is  to  find  a  schedule  o-  :  O  — >  N  such  that: 

1.  minimality:  a  is  of  minimum  length.  The  length  of  a  schedule  cr,  lcn{cr)  =  max((T(o;;)  +  d{o]i)) 

2.  dependence  constraints:  Ve  =  (opj,  ,opj^)  €  E,  o(opj^)  —  a{opj^ )  >  6{e) 

3.  resource  constraints:  define  vector  addition  in  the  usual  way,  and  vector  comparison  :<  to  be; 
ill  ^  tT2     'ff    V  0  <  A:  <  m     Vi(k)  <  V2(k),  then  the  resource  constraints  are: 

I 
VO<t<len(a)     "^  Ro(opj  ,t  -  (7(opj))    X    Rj^ 
J  =  i 

The  local  compaction  problem  bears  a  lot  of  similarity  with  job  scheduling  and  is  likewise  NP-complete. 
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3.2      List  Scheduling 

List  scheduling  evolved  from  research  in  job  sclieduling  theory  [Cofrman76]  and  is  one  of  the  methods 
used  to  perform  local  compaction.  In  spite  of  the  theoretical  complexity  of  LCP,  extensive  experiments 
have  demonstrated  that  list  scheduling  is  an  effective  heuristic  to  obtain  near  optimal  solutions  for  LCP 
[Fisher79,Fisher81a,Fisher81b]. 

List  scheduling  generates  VLIW  instructions  by  performing  a  topological  sort  of  the  DDD.  Machine 
resources  like  registers  and  busses  are  usually  assigned  on  the  fly  while  traversing  the  DDD.  Functional  units 
may  however,  be  assigned  by  a  preliminary  pass.  The  functional  unit  assignment  pass  plays  the  same  role  as 
register  allocation  for  conventional  compilers,  and  uses  the  same  kind  of  algorithms.  Preliminary  functional 
unit  assignment  is  done  when  the  target  machine  is  not  "symmetric",  i.e.  when  functional  units  performing 
the  same  operation  do  not  have  the  same  communication  capabilities  with  other  machine  resources.  When 
the  target  machine  is  "symmetric"  functional  units  are  assigned  on  the  fly,  like  any  other  machine  resource. 
What  differentiates  the  two  versions  of  list  scheduling,  is  the  way  instructions  are  generated.  In  both  versions 
the  DDD  is  traversed  in  breadth  first  topological  order. 

Functional  units  assigned  by  a  preliminary  pass.  Instructions  are  generated  one  at  a  time,  in  sequen- 
tial order:  first  the  instruction  executed  at  cycle  0,  then  the  one  executed  at  cycle  1,  etc.  To  form 
the  next  instruction,  list  scheduling  considers  all  data  ready  nodes.  A  node  is  data  ready  if  all  the 
operations  on  which  it  depends  (i.e.  its  predecessor  in  the  DDD)  have  already  been  scheduled.  The 
list  scheduler  fills  the  current  VLIW  instruction  with  as  many  data  ready  operations  as  resource  con- 
straints permit.  When  no  more  data  ready  nodes  can  be  squeezed  in,  the  instruction  is  emitted,  and 
the  packing  of  a  new  instruction  is  started. 

Functional  units  assigned  on  the  fly.  Several  instructions  are  generated  at  the  same  time.  When  an 
operation  op  is  selected  from  the  data  ready  list,  the  earliest  cycle  (Min_Cycle)  in  which  op  can  be 
scheduled  is  computed.  Min.Cycle  is  determined  by  the  data  precedence  constraints  imposed  by  the 
already  scheduled  operations.  Based  on  the  resource  constraints  imposed  by  the  already  scheduled 
operations,  we  determine  Ok.Cycle  (Min.Cycle  <  Ok.Cycle),  the  earliest  cycle  in  which  op  is  actually 
scheduled.  If  all  operations  take  unit  time,  or  machine  resources  are  infinite,  this  view  of  list  scheduling 
is  identical  to  the  previous  one,  that  is  instructions  are  created  sequentially,  and  not  simultaneously. 

When  packing  operations  into  VLIW  instructions,  the  code  generator  is  often  faced  with  a  choice  of 
several  data  ready  nodes.  A  commonly  used  heuristic  is  the  critical  path  heuristic,  where  operations  which 
lie  on  the  longest  path  (in  terms  of  execution  time)  are  given  priority. 

Let  Succ{op)  be  the  set  of  operations  which  depend  on  op  in  the  DDD  (i.e.  the  successors  of  op  in  the 
DDD),  and  Pred{op)  be  the  set  of  operations  on  which  op  depends  (i.e.  the  predecessors  of  op  in  the  DDD). 
Let  Priority{op)  be  the  priority  of  op.  Here  is  the  list  scheduling  algorithm  for  a  "symmetrical"  VLIW 
machine  (functional  units  are  assigned  on  the  fly): 
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FUNCTION  List.Schedule  (  DDD  =  (0,E))  RETURN  a  schedule  a  IS 
BEGIN 

Data.Ready  :=  {op  £  0\Pred{op)  =  cj)}, 

Scheduled  :=  (f>; 

WHILE  Data.Ready  9^  <?!)  LOOP 

let  ophp  be  the  operation  with  highest  Priority  in  Data.Ready; 
Min.Cycle  ;=         max       (a(op)  +  S(op,opt,p)), 

op^  Scheduled 

let  Ok.Cycle  >  Min.Cycle  the  earliest  cycle  such  that: 

yO<t<d{ophp)    Ro{ophpJ)+         J2         Ro{op,,{0\^.Cyc\e-a(op,))  +  t)   ^  Rm  : 

op,  €  Scheduled 

<^{ophp)  ■=  Ok.Cycle; 

Scheduled  :=  Scheduled  U  {ophp}', 

Data.Ready  :=  Data.Ready  -  {ophp}  U  {ops^cc  £  Succ(ophp)\Pred(op,urr)  C  Scheduled}; 
END  LOOP; 
RETURN  a- 
END  List3cheduling; 

A  simple  implementation  of  the  algorithm  runs  in  O(n^)  plus  the  time  to  compute  the  Pnoriiy  function  for 
the  n  operations  {opi , . .  . ,  opn  } . 

Let  us  now  consider  the  following  example.  Assume  an  (unrealistic)  machine  with  one  floating  point 
adder,  multiplier  and  square-root/divider  units.  All  these  units  are  pipelined,  and  take  respectively  two,  two 
and  three  cycles  to  complete  an  operation.  We  assume  memory  and  register  access  time  to  be  null.  Consider 
the  following  straight  line  code  which  computes  the  roots  of  a  second  degree  polynomial  (we  assume  D  > 
0).  Its  intermediate  representation  is  given  to  its  right.  The  DDD  is  given  in  figure  4(a).  Nodes  containing 
variables  have  been  duplicated  for  convenience. 


-  assume  D  >  0 


tl 

=  b*b 

t2 

=  a*c 

13 

=  4.0*12 

D  : 

=  11-13 

t4 

=  sqrl(D) 

t5 

=  -b 

t6 

=  t5-t4 

t7 

=  2.0+a 

xl 

:=  16/17 

18 

=  15+14 

x2 

:=  18/17 

D  :=  b*b-4  0*a*c; 

xl  :=  (-b-sqrl(D))  /  (2.0+a); 

x2  :=  (-b+sqrl(D))  /  (2,0*a); 


If  we  use  the  critical  path  heuristic  for  the  Pnortty  function,  the  schedule  produced  by  list  scheduling  is 
given  in  figure  4(b). 

The  previous  algorithm  schedules  operations  as  early  as  possible,  regardless  of  the  cycle  in  which  an 
operation  result  is  used.  This  increases  variable  lifetime  (the  time-span  between  the  definition  and  last  use  of 
a  variable),  and  consequently  the  time-span  during  which  a  variable  has  to  be  kept  in  a  register.  To  reduce 
variable  lifetime  a  bottom-up  approach  can  be  used,  as  suggested  by  Lam  [Lam87]  and  other  researchers 
[MuellerSG].    DDD  edges  are  reversed,  and  list  scheduling  is  applied  to  this  modified  DDD.  The  schedule 
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Figure  4:    (a)  The  data  dependence  dag.    Nodes  containing  variables  are  duplicated  for  convenience,    (b) 
Schedule  obtained  by  applying  list  scheduling. 


obtained  will  itself  have  to  be  reversed  to  obtain  a  valid  schedule.  By  using  reverse  topological  ordering, 
a  store  into  a  register  is  scheduled  after  all  its  uses  have  been  scheduled.  By  placing  the  store  as  close  as 
data  and  resource  constraints  permit,  the  lifetime  of  a  variable  is  shortened.  If  bottoiu-up  list  scheduling  is 
applied  to  the  previous  example  t7:=2.0*a  will  be  scheduled  in  cycle  10  instead  of  4. 
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4      Global  Compaction  Techniques 

This  section  focuses  on  three  important  global  compaction  techniques:  trace  scheduling  [Fisher84b,Enis86], 
percolation  scheduling  [Nicolau85a,Nicolau85b,Aiken88a]  and  region  scheduling  [Gupta87c,Gupta88]. 

As  previously  mentioned,  the  necessity  of  such  global  compaction  techniques  stemmed  from  experiments 
performed  in  the  early  70s.  The  experiments  (section  4.1)  showed  that  very  little  fine  grain  parallelism  is 
available  within  basic  blocks,  but  massive  amounts  of  such  parallelism  can  be  uncovered  if  one  disregards 
the  basic  block  barrier. 

Trace  scheduling  (section  4.2)  was  one  of  the  first  techniques  that  went  beyond  basic  blocks  in  its  search 
for  fine  grain  parallelism.  This  technique  employs  branching  probabilities  to  make  educated  guesses  on  the 
run  time  behavior  of  a  program. 

Percolation  scheduling  (section  4.4)  is  a  simplification  and  generalization  of  the  ideas  behind  trace  schedul- 
ing. A  program  is  compacted  without  the  need  for  run  time  information  by  repeatedly  applying  a  series  of 
elementary  program  transformations. 

Region  scheduling  (section  4.5)  redistributes  the  parallelism  available  throughout  a  program,  taking  into 
account  machine  capabilities.  Like  in  percolation  scheduling,  repetitive  application  of  elementary  program 
transformations  is  employed.  However  while  percolation  scheduling  transformations  operate  on  the  flow 
graph  of  a  program,  region  scheduling  employs  a  new  intermediate  program  representation  paradigm;  the 
program  dependence  graph  [FerranteST]. 

4.1      Amount  of  Fine  Grain  Parallelism  Available  in  Serial  Programs 

Experiments  performed  in  the  early  70s  [Riseman72,Foster72,Tjaden70]  looked  at  the  amount  of  fine  grain 
parallelism  that  was  available  in  serial  programs.  The  experimenters  examined  machine  language  streams 
of  IBM  and  CDC  machines,  broke  these  streams  into  basic  blocks,  and  determined  through  simulation  tiie 
speedup  that  one  could  obtain  by  using  infinite  hardware.  Memory  access  time  was  assumed  to  be  null  and 
renaming  was  used  to  avoid  unnecessary  contention  problems.  The  primary  concern  of  the  experiments  was 
the  manner  in  which  conditional  instructions  were  treated. 

Blocking  on  conditionals:  the  first  set  of  experiments  explored  the  parallelism  available  wtthtii  basic 
blocks,  that  is,  upon  encountering  a  conditional  branch  the  infinite  machine  would  wait  until  the 
condition  resolved  before  scheduling  any  other  operation.  The  speedup  obtained  was  on  the  order  of 
2.  This  is  insufficient  to  keep  a  highly  parallel  architectures  busy. 

Bypassing  conditionals:  the  second  set  of  experiments  assumed  the  path  taken  at  each  branch  to  be 
known  in  advance.  Speedups  ranging  from  8  to  over  100,  with  an  average  of  about  50  were  observed. 
The  experiments  also  considered  bypassing  a  fixed  number  of  conditionals  at  any  given  time.  The 
relative  speedup  increased  as  y/j,  where  j  is  the  number  of  conditionals  bypassed.  This  relation  holds 
well  for  j  <  32  [Riseman72].  Foster  and  Riseman  give  no  theoretical  justification  for  this  fact. 

Another  set  of  experiments  determined  the  amount  of  parallelism  that  could  be  uncovered  if  conditional 
jumps  are  ignored,  but  tiie  search  for  parallelism  is  limited  to  a  fixed  window  of  instructions.  [Riseman72] 
found  that  for  a  window  of  size  64,  only  one  fourth  of  the  potential  speedup  could  be  obtained.  Tiiis  implies 
that  instructions  must  be  extensively  rearranged  in  order  to  achieve  maximum  speedup. 
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The  programs  tested  were  symbolic  as  well  as  numerical.  It  is  interesting  to  remark  that  experimenters 
found  no  significant  result  differences  among  these  two  kinds  of  programs  [Riseman72].  This  would  tend  to 
support  the  use  of  VLIWs  to  speedup  non  numerical  type  of  code  [Ebcioglu88]. 

Nicolau  and  Fisher  [Nicolau81,Nicolau84b]  did  similar  experiments,  focusing  on  numerical  programs. 
They  also  assumed  a  machine  with  infinite  hardware.  An  oracle  was  used  to  compact  code  for  execution  on 
the  infinite  machine.  The  role  of  the  oracle  was  double.  It  had  to  predict  the  outcome  of  conditional  jumps 
and  to  resolve  all  ambiguous  memory  references.  Extensive  renaming  was  used.  The  execution  time  of  the 
oracle  compacted  program  on  the  infinite  machine,  is  just  the  length  of  the  longest  data  flow  dependence 
chain  in  the  program.  The  speedups  observed  averaged  90. 


4.2      Trace  Scheduling 

4.2.1      The  Trace  Scheduling  Algorithm 

One  of  the  first  techniques  to  exploit  fine  grain  parallelism  across  basic  blocks  was  trace  scheduling  (TS) 
[Fisher81a,Fisher81b].  The  experiments  Nicolau  and  Fisher  performed,  were  carried  out  with  trace  schedul- 
ing specifically  in  mind.  The  primary  goal  of  TS  was  to  replace  the  experiments  oracle  with  some  sort  of 
compile  time  predictor.  The  compile  time  predictor  was  obtained  by  assigning  branching  probabilities  to 
conditional  jumps.  By  using  these  branching  probabilities,  TS  can  select  the  most  likely  execution  path  of 
a  program  (such  a  path  is  called  a  trace).  Once  a  trace  has  been  selected,  it  is  compacted  as  if  it  were  a 
big  basic  block.  The  compaction  phase  may  move  operations  above  or  below  conditional  jumps.  To  preserve 
semantic  equivalence  with  the  original  program,  TS  has  to  insert  bookkeeping  code  in  the  points  where  the 
trace  interfaces  with  the  rest  of  the  program  (these  connection  points  are  called  joins  and  splits'^).  After  the 
most  probable  trace  has  been  compacted  the  next  most  probable  trace  is  selected.  The  process  repeats  until 
all  but  the  most  unprobable  traces  remain.  The  remaining  blocks  are  compacted  using  some  local  compaction 
algorithm.  See  Figure  5  for  an  example.  Dashed  edges  in  the  DDD  are  special  edges  and  are  similar  in  nature 
to  control  dependence  edges  (section  2.2).  TS  has  to  introduce  such  edges  to  prevent  illegal  code  motions 
(in  this  case  prevent  the  square-rooting  of  a  negative  number,  and  division  by  zero).  Operations  in  a  single 
horizontal  box  can  be  executed  concurrently.  Note  that  in  this  particular  example  the  bookkeeping  nodes 
are  not  needed,  but  there  is  no  way  for  TS  to  tell,  without  extensive  analysis.  We  now  describe  the  three 
phases  of  TS  in  more  detail  [Fisher84a,Fisher84b,Ellis86]. 

Trace  selection.  In  order  to  select  traces  one  must  gather  probabilities  about  the  branching  behavior 
of  a  program.  TS  relies  on  programmer  supplied  hints,  and  loop  nesting.  The  programmer  can  also  use 
an  automatic  profiler.  By  using  branching  probabilities  and  loop  bounds  TS  selects  a  seed  operation  with 
highest  execution  probability.  The  trace  is  grown  back  and  forward  from  the  seed.  To  grow  forward,  the  trace 
picker  looks  at  opiast,  the  operation  currently  at  the  end  of  the  trace.  K  opia,t  is  not  a  conditional  branch  the 
choice  of  the  next  operation  is  obvious.  If  opiast  is  a  conditional  branch  then  we  select  the  operation  which 
is  executed  by  following  the  branch  with  highest  probability.  Once  the  successor  operation  has  been  selected 
it  is  added  to  the  end  of  the  trace  and  the  process  is  repeated.   A  trace  is  grown  backward  in  an  analogous 

A  split  is  a  conditioiicJ  jump  out  of  the  trace,  a  join  is  a  conditional  jump  into  tlie  trace. 
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PROCEDURE  Roots  (a.b.c;  IN  Float; 

xl,x2:   OUT  Float; 
Num.Roots    OUT  Integer, 
Ok;  OUT  Boolean)  IS 
D;  Float; 
BEGIN 

IF(a=0)THEN 

Ok:=False;  RETURN; 
END  IF; 

D  :=  b  *  b  -  4  0  *  a  +  c; 
IF  (D  <  0.0)  THEN 

Ok  :=  False; 
ELSE 

Ok  :=  True; 
IF  (D  =  0  0)  THEN 
Num.Roots  ;=  1; 
ELSE 

Num.Roots  :=  2; 
END  IF; 

xl  :=  (-b-sqrt(D))/(2.0  ♦  a), 
x2  :=  (-b+sqrt(D))/(2.0  *  a); 
END  IF; 
END  Roots, 
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Figure  5:  (a)  Initial  program,  (b)  Program  flow  graph.  The  trace  is  composed  of  the  operations  connected 
by  solid  lines,  (c)  The  data  dependence  dag  of  the  trace,  (d)  The  program  flow  graph  after  compaction  and 
bookkeeping. 
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fashion.  A  trace  stops  growing  in  one  of  the  two  directions  when  the  operation  chosen  has  already  been 
selected  by  a  previously  formed  trace.  Also  a  trace  is  not  allowed  to  cross  loop  boundaries  (this  restriction 
wEis  imposed  to  make  memory  reference  disambiguation  simpler).  To  increase  the  parallelism  of  loop  body 
traces,  the  compiler  unrolls  inner  loops. 

Trace  Compaction.  After  a  trace  has  been  selected  it  is  handed  to  the  code  generator,  which  compacts  it 
using  list  scheduling.  Because  traces  span  over  several  basic  blocks,  special  edges  (control  dependence-like 
edges)  have  to  be  added  to  the  trace  DDD  to  prevent  illegal  code  motions  (note  that  the  CDG  hadn't  been 
discovered  at  the  time  research  on  trace  scheduling  was  performed).  Let  cj  and  op  be  respectively  a  condi- 
tional jump  and  an  operation.  If  both  cj  and  op  lie  on  the  trace  and  op  clobbers  a  variable  which  is  live  on 
the  off-trace  branch  of  cj ,  a  special  edge  is  inserted  in  the  DDD  from  cj  to  op. 

After  the  trace  has  been  compacted  and  VLIW  instructions  generated,  the  code  generator  deletes  the 
trace  from  the  program  flow  graph  and  replaces  it  with  position  information.  Position  information  gives 
the  register  or  memory  locations  where  variables  live  at  the  beginning  or  end  of  the  trace  are  stored.  This 
information  is  needed  to  interface  traces. 

Bookkeeping.  After  compacting  a  trace,  the  bookkeeping  phase  has  to  ensure  that  semantic  equivalence 
with  the  initial  program  is  preserved.  If  an  operation  op  originally  below  a  split  is  moved  above  it  by  the 
code  generator,  op  has  to  be  copied  in  the  split  off-trace  branch.  The  case  of  a  join  is  similar:  a  joni  is  placed 
as  early  as  possible  in  the  compacted  code,  such  that  no  operation  originally  above  it,  is  now  below  it.  All 
operations  that  have  moved  above  joins  have  to  be  copied. 

To  copy  a  jump  into  a  split  one  has  to  consider  the  off-trace  and  the  on-trace  edges  of  the  conditional 
jump.  When  the  conditional  jump  is  copied  the  off-trace  edge  of  the  copy  is  left  unchanged  while  the  on-trace 
edge  points  to  the  next  operation  in  the  split.  Things  get  a  little  more  complicated  when  jumps  have  to  be 
copied  up  into  joins.  Consider  the  operations  on  the  trace  that  are  below  the  join  and  above  the  jump,  but 
that  are  not  copied  into  the  join  because  they  are  scheduled  below  it.  These  operations  have  to  be  copied 
onto  the  off-trace  edge  of  the  jump's  new  copy.  Note  that  this  rule  is  not  applied  recursively:  if  the  operations 
contain  conditional  branches,  they  are  just  copied  like  any  other  operation. 

4.2.2      VLIW  Machines  Using  Trace  Scheduling 

The  Bulldog  (Fortran)  compiler  designed  at  Yale  implemented  TS  [Ellis86,Fisher84b].  Its  target  machine 
was  the  ELI  (Enormously  Longword  Instructions)  [Fisher83].  This  machine  was  never  built  and  remained 
a  paper  design.  However  some  of  the  people  working  on  the  VLIW  project  at  Yale  co-founded  a  company 
(Multiflow  Computers,  Inc.)  to  produce  a  commercial  processor  (the  TRACE(T.A/)  series  [Colwell88])  using 
VLIW  jind  trace  scheduling  technology.  Both  the  ELI  and  the  TRACE  have  been  designed  for  scientific 
computing.  We  first  describe  ELI  and  then  its  industrial  evolution. 

ELI's  machme  resources  are  divided  among  8  identical  clusters  which  are  connected  by  long  slow  busses 
(figure  6(a)  shows  the  cluster  interconnections).  Cluster  resources  are  connected  via  a  partial  crossbar;  they 
comprise  (see  figure  6(b)): 

•  an  integer  ALU,  a  pipelined  floating  adder  and  a  pipelined  floating  multiplier.  The  inputs  of  each  func- 
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Figure  6:  (a)  Cluster  interconnections  of  the  ELI.  (b)  The  structure  of  a  cluster. 


tional  unit  are  connected  to  two  register  files  of  16  registers  each.  Division  is  not  directly  implemented, 
a  Newton's  method  approximation  is  used  (an  8  bit  inverse  is  obtained  via  table  look-up); 

•  a  global  bus  unit  that  connects  the  cluster  to  each  of  its  four  neighbors; 

•  one  of  8  dual  ported  data  memory  banks.  To  match  processor  and  memory  speeds  ELI  provides  one 
dual  ported  data  memory  bank  per  cluster.  The  first  port  (front  door)  provides  direct  access  to  memory 
while  the  second  port  (back  door)  is  connected  to  a  central  memory  controller  (located  in  some  of  the 
clusters)  which  resolves  (at  execution  time)  memory  references  whose  bank  could  not  be  determined  at 
compile  time.  A  memory  bank  front  door  is  connected  to  the  cluster  partial  crossbar.  The  front  door 
contains  a  limited  integer  ALU  for  address  arithmetic,  as  well  as  two  sets  of  index  registers  (not  shown 
in  figure  6). 

Loop  unrolling  is  one  of  the  techniques  the  Bulldog  compiler  uses  for  memory  bank  disambiguation.  A 
preloop  fixes  at  run  time  the  initial  memory  bank  of  a  series  of  vector  references.  The  remaining  iterations 
are  executed  in  a  second  loop  where  unrolling  ensures  that  vector  references  are  bank  disambiguated. 

The  ELI  has  an  instruction  word  of  over  500  bits.  Each  cluster  activity  is  controlled  by  a  piece  of  this 
very  long  instruction. 

As  demonstrated  by  the  experiments  of  section  4.1,  basic  blocks  do  not  contain  enougii  fine  grain  paral- 
lelism. This  implies  that  a  high  fraction  of  program  operations  are  tests.  Packing  a  single  conditional  jump 
per  VLIVV  instruction  will  therefore  prevent  significant  speedups,  as  a  computation  running  time  is  bounded 
by  the  number  of  conditionals  that  have  to  be  executed  at  run  time.  Part  of  ELI  very  long  instruction  word 
is  therefore  devoted  to  conditional  jumps.    Up  to  k  (k  £  {3,4,5})  conditional  branches  can  be  packed  per 


VLIW  instruction.  Semamtically  the  conditional  part  of  an  instruction  executes  like  a  lisp  cond  statement. 
The  k  conditional  tests  are  evaluated  one  after  the  other.  The  first  test  which  is  true  causes  a  jump  to  its 
corresponding  branching  address.  If  none  of  the  tests  is  true  control  falls  through  to  the  next  instruction.  Of 
course  the  haxdwaxe  implementation  evaluates  all  the  k  conditions  in  parallel  and  uses  a  priority  encoder 
to  select  the  first  one  that  is  true.  Note  that  this  branching  mechanism  is  closely  related  to  the  needs  of  TS. 
The  results  obtained  for  the  BuUdog-ELI  symbiosis  are  encouraging.  An  ideal  ELI  with  an  infinite  num- 
ber of  clusters,  where  operations  take  a  cycle  and  intra-duster  communication  delays  are  nil  can  achieve 
speedups  ranging  from  3  (for  sequential  applications)  to  50  (averaging  17)''.  Most  of  the  applications  tested 
were  scientific  routines.  The  results  obtained  for  the  (realistic)  ELI  previously  described  are  speedups  up  to 
8  (which  is  the  maximum  speedup  since  there  are  8  clusters),  with  an  average  of  5. 

While  the  ELI  was  originally  designed  to  be  a  back  end  processor,  the  TRACE(TA/)  series  built  at 
Multiflow  [Colwell88]  is  a  fully  fledged  multiuser  machine  supporting  virtual  memory  and  running  UNIX. 
The  machine  has  three  hardware  configurations,  capable  of  executing  7,  14  or  28  instructions  simultaneously. 
The  14  instructions  machine  yields  4  to  10  times  the  performance  of  a  DEC  8700  at  twice  the  cost,  and  from 
1/2  to  1/5  the  performance  of  a  Cray-XMP  that  costs  over  20  times  as  much  (these  benchmarks  are  based 
on  Linpack  and  Livermore  Loops).  The  goals  in  the  design  of  the  TRACE  series  were  to  use  high  volume 
low  cost  electronics,  DRAMs  for  main  memory,  support  IEEE  64-bit  standard  floating  point  computations, 
and  perform  well  in  a  multiuser  environment. 

The  computing  power  of  the  machine  is  provided  through  four  integer  and  floating  point  boards  (respec- 
tively called  I  and  F  boards,  see  figure  7  for  details).  An  I-F  board  pair  corresponds  to  an  ELI  cluster.  A 
machine  is  sold  with  one,  two  or  four  I-F  boards,  corresponding  to  a  256,  512  or  1024  bit  instruction  word. 
The  1024-bit  configuration  initiates  28  operations  and  4  ELI-Hke  conditional  tests  (one  per  I  board)  for  each 
instruction.  The  machine  peak  performance  is  60  MFLOPs. 

In  order  to  support  virtual  memory,  every  machine  operation  carries  along  its  destination  (register  or 
memory)  so  that  when  an  interrupt  occurs  the  system  can  complete  the  current  operations  and  save  the 
computed  results,  before  responding  to  the  interrupt. 

In  the  fully  configured  machine  four  memory  references  (one  per  I  board)  can  be  started  every  65  ns. 
As  no  data  cache  is  present,  memory  bandwidth  is  provided  through  eight  memory  controllers  each  one 
controlling  up  to  eight  independent  banks.  Memory  addresses  are  fully  interleaved.  Figure  8  shows  the  top 
level  architecture  of  a  fully  configured  TRACE. 

The  TRACE  has  a  physically  distributed  instruction  cache.  There  is  1  Mbyte  of  cache  in  a  fully  configured 
machine.  Cache,  as  well  as  TLBs,  are  process  tagged,  so  that  no  flushing  is  necessary  at  context  switches. 
Instruction  caches  output  a  fixed  length  1024-bit  instruction  per  clock  cycle  (in  a  fully  configured  machine). 
The  bits  output  by  the  caches  are  hardwired  to  the  functional  units  that  they  control.  Instructions  which 
do  note  involve  floating  point  computations  contain  a  number  of  no-ops.  To  minimize  the  size  of  the  object 
code  stored  in  primary  or  secondary  storage  a  variable  length  representation  of  the  fixed  length  machine 
instructions  is  used. 

TRACE  runs  4.3  BSD  UNIX.  UNIX  C  code  has  been  compiled  using  a  trace  scheduling  compiler,  and 
has  produced  good  VLIVV  code,  something  the  Multiflow  team  did  not  expect.  The  size  of  the  UNIX  code 


*The  speedup  ratios  are  all  measured  against  the  1  cluster  ELI  configuration. 
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on  the  TRACE  is  a  factor  of  3  bigger  than  on  a  VAX  (this  growth  in  size  is  due  to  bookkeeping). 

4.2.3      Problems  of  Trace  Scheduling 

TS  has  various  problems,  one  of  the  biggest  being  code  explosion.  The  patching  code  generated  during 
bookkeeping  can  potentially  be  exponential  in  the  size  on  the  input  program  [Eliis86].  One  might  indeed 
wonder  whether  TS  actually  terminates,  as  the  generation  of  bookkeeping  code  might  constantly  increase  the 
size  of  the  code  still  to  trace  schedule.  [Nicolau84a]  proved  that  this  is  not  the  case.  In  practice  bookkeeping 
does  indeed  increase  code  size  by  a  substantial  amount.  If  we  add  in  the  amount  of  extra  code  generated  by 
loop  unrolling,  code  size  increase  might  become  problematic.  In  the  experiments  that  Ellis  performed  code 
size  increased  by  a  factor  of  6  [Ellis86].  The  following  program  fragment  illustrates  code  explosion: 

FOR  i  IN  In  LOOP 

IF  abs(a(i))  X  THEN 

c(i)  :=  b(i)/a(i): 
ELSE 

c(i)  :=  c.max; 
a(i+l)  :=  2  *  a(l  +  l); 
END  IF; 
END  LOOP; 

Assuming  a  64  banks  VLIW  machine,  the  loop  body  might  be  unrolled  as  much  as  64  times.  The  flow 
graph  that  we  would  obtain  would  then  look  like  figure  9(a).  Assuming  that  the  most  probable  trace  is 
IF  cl;  ol;  IF  c2;  o2;  ...  ;  IF  en;  on,  figure  9(b)  shows  how  the  code  generator  might  possibly  reorder  trace 
operations.  Note  that  although  this  reordering  would  normally  be  illegal  for  tiie  above  loop,  provisions  for 
this  kind  of  code  motions  are  made  in  tiie  TRACE.  Figure  9(c)  gives  the  loop  after  bookkeeping. 

Let  n  be  the  number  of  conditional  branches  in  the  initial  loop  body,  and  let  f{n)  be  the  code  size  of  the 
loop  body  after  the  entire  loop  has  been  trace  scheduled.  We  have: 

f(n)  =  27,  +  (f{n  -  1)  +  1)  +  . .  .  +  (/(I)  +  1)  +  /(O) 

where  /(O)  =  1.  By  differencing  and  substitution  we  obtain:  /(n)  =  3  ■  2"  -  3  (n  >  1).  Ellis  [Eliis86]  has 
showed  that  0(n\)  increase  in  code  size  is  possible. 

Inferring  a  program  run  time  behavior  is  another  problem  of  TS.  While  branching  probabilities  are 
skewed  for  numerical  code,  50-50  branching  behavior  is  more  likely  to  be  observed  for  symbolic  programs. 
This  restricts  the  applicability  of  TS  to  scientific  code  (although  the  Multiflow  team  employed  TS  for  system 
code).  It  is  possible  however,  to  apply  non  trivial  global  code  motions  without  any  knowledge  of  a  program  run 
time  behavior  [Nicolau85a,Gupta87c,Gupta88].  This  leads  to  a  more  structured  global  code  motion  strategy 
which  avoids  some  of  the  useless  bookkeeping  code  introduced  by  TS  (like  in  the  example  of  figure  5). 

The  unsatisfactory  treatment  of  loops  is  another  major  problem  of  TS.  As  traces  are  limited  to  loop 
inner  bodies,  loop  unrolling  hais  to  be  employed  to  obtain  suflTiciently  long  traces.  Since  performance  usually 
improves  with  the  amount  of  unrolling,  speeding  loops  conflicts  with  the  need  of  preventing  code  explosion. 
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Figure  9:    (a)  Example  of  a  flow  graph  which  produces  code  explosion,    (b)  Possible  reordering  of  trace 
operations  made  by  the  code  generator,  (c)  The  loop  after  bookkeeping. 
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4.3     Improvements  of  Trace  Scheduling 

In  this  section  we  survey  various  improvements  of  TS. 

•  Tree  compaction  [Lah83]  avoids  code  explosion,  as  well  as  TS  bookkeeping  complexity. 

•  SRDAG  compaction  [Linn83],  is  a  generalization  of  TS.  It  is  a  first  attempt  to  use  more  global  infor- 
mation in  making  compaction  decisions. 

Another  improvement  of  TS  is  due  to  Su.  Su  [Su84]  gives  a  method  that  provides  an  alternative  treatment 
of  loops.  His  URCR  (UnRoll  Compact  ReroU)  method  was  developed  in  the  context  of  microcode  compaction. 
The  idea  behind  URCR  has  been  subsequently  adapted  to  serial  programs  [Su86,Su87],  and  is  presented  in 
section  5. 

4.3.1      Tree  Compaction 

Tree  compaction  works  like  TS  except  that  the  schedule  range  is  limited.  A  program  is  partitioned  into  top 
and  bottom  subtrees  to  which  TS  is  separately  applied.  Operations  are  not  allowed  to  be  moved  passed  the 
root  of  the  subtree  they  are  in.  This  limits  the  number  of  operation  copies  that  have  to  be  made  during 
bookkeeping. 


(•)  w  (O 

Figure  10:    (a)  A  program  flow  graph,    (b)  The  top  trees  of  the  flow  graph.    Top  trees  are  surrounded  by 
dotted  lines,  (c)  The  bottom  trees  of  the  flow  graph. 

The  top  trees  of  a  flow  graph  are  determined  as  follows  (bottom  trees  are  determined  by  using  the  same 
algorithm  on  the  reversed  flow  graph). 
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1.  Start  at  the  entry  block  of  the  flow  graph.  Mark  it  a  top  tree  root. 

2.  Once  the  root  r  of  a  top  tree  has  been  determined  perform  a  depth  first  searcli  starting  at  r. 

3.  Stop  growing  the  tree  in  the  current  direction  as  soon  as  you  hit  a  vertex  v  with  oui-degree(v)  >  2  or 
with  in-degree(v)  >  1;  enqueue  such  a  vertex  t;  in  a  queue  Q  (if  v  not  already  in  Q). 

4.  Once  the  current  top  tree  has  been  fully  grown  dequeue  the  first  element  of  Q,  mark  the  vertex  as  a 
top  tree  root  and  repeat  steps  2  and  3. 

See  figure  10  for  an  example.  After  the  blocks  of  the  flow  graph  have  been  partitioned  into  top  trees,  TS 
is  applied  individually  to  each  top  tree.  Let  Vtop  be  the  program  obtained  after  all  the  top  trees  have  been 
compacted.  The  final  code  is  obtained  by  building  Vtop's.  flow  graph,  finding  its  bottom  trees  and  applying 
TS  individually  to  each  bottom  tree.  Note  that  like  in  TS  traces  never  cross  loop  boundaries  as  top  and 
bottom  trees  never  contain  back  edges. 

The  worst  case  space  requirements  of  the  algorithm  are  quadratic  in  the  number  of  conditional  branches 
present  in  the  program  [Lah83].  Experiments  [Su85]  performed  in  the  realm  of  microcode  programs  showed 
that  tree  compaction  has  indeed  better  space  requirements  than  TS,  and  that  the  space  saved  does  not 
entail  a  significant  loss  in  execution  speed.  In  addition  tree  compaction  executes  fcister  than  TS  since  the 
complexity  of  list  scheduling  and  bookkeeping  are  lower,  as  traces  are  shorter.  These  results  do  not  carry 
over  directly  to  serial  programs,  but  give  nonetheless  some  indications. 

4.3.2      Singly  Rooted  Dag  Compaction 

The  singly  rooted  dag  (SRDAG)  method  is  bcised  on  the  observation  that  traces  should  be  reselected  after 
the  first  block  of  the  current  trace  has  been  compacted.  The  reason  is  that  once  a  conditional  branch  has 
been  moved  up  and  scheduled  in  the  first  block,  the  most  likely  trace  in  the  program  might  have  changed. 
While  TS  limits  its  global  view  of  a  program  to  simple  flow  graph  paths,  the  SRDAG  algorithm  considers 
whole  singly  rooted  dags.  The  advantage  over  TS  is  that  there  is  a  more  global  view.  For  example  an 
operation  might  be  present  at  the  top  of  several  basic  blocks  in  the  SRDAG.  The  SRDAG  algorithm  prefers 
to  move  such  duplicated  operations  up,  whereas  TS  is  not  aware  of  this  possible  improvement.  The  SRDAG 
algorithm  is  iterative.  An  iteration  comprises  3  steps: 

1.  SRDAG  selection.  Use  the  program  probability  function  to  determine  which  uncompacted  basic 
block  Broot  (with  no  uncompacted  predecessors)  has  highest  execution  probability.  The  SRDAG  is  the 
maximal  singly  rooted  dag,  rooted  at  Broot .  comprising  only  uncompacted  blocks  (note  that  SRDAGs 
never  cross  loop  boundaries). 

2.  Compaction.  Create  an  empty  basic  block  Bcompacted  above  Broot-  Use  list  scheduling  to  move 
operations  from  an  arbitrary  block  in  the  SRDAG  to  Bcompacted-  List  scheduling  is  modified  so  that 
multiple  data  ready  copies  of  an  operation  are  represented  by  a  single  instance  of  the  operation  (a  list 
of  pointers  is  maintained  to  locate  the  instances  of  the  operation).  The  priority  of  the  representing 
operation  is  the  sum  of  the  individual  priorities.  Being  able  to  schedule  this  set  of  identical  operations 
all  at  once  is  the  global-view  advantage  of  the  SRDAG  algorithm  over  TS.  The  compaction  step  ends 
SlS  soon  as  a  conditional  branch  is  scheduled  in  Bcompacttd- 

24 


3.  Bookkeeping.  Like  for  TS  the  compaction  phase  necessitates  the  introduction  of  bookkeeping  code  to 
preserve  semantic  equivalence  with  the  original  program.  The  bookkeeping  algorithm  is  similar  to  that 
of  TS,  but  is  complicated  by  the  fact  that  in  the  SRDAG  algorithm  we  are  dealing  with  dags  instead 
of  paths.  See  figure  11  for  an  example. 


■^     I  "compacted 


Brool      i  r(E) 


(a) 


(b) 


Figure  11:  (a)  The  original  SRDAG.  o  is  the  operation  to  be  scheduled  in  Bcompacted-   (b)  The  flow  graph 
after  bookkeeping. 


The  algorithm  iterates  until  the  whole  program  has  been  compacted  (but  no  proof  that  the  algorithm 
terminates  is  provided).  Linn  conjectures  that  code  compacted  with  the  SRDAG  method  will  never  run  more 
than  a  factor  2  faster  than  trace  scheduled  code. 

As  it  is  stated,  the  algorithm  allows  only  a  single  conditional  jump  per  (very  long)  instruction.  This 
restriction  can  be  removed  by  doing  the  following.  As  soon  as  a  conditional  branch  is  moved  in  Bcompacted  a 
special  mode  is  entered  where  only  conditional  branches  are  allowed  to  move  in  Bcompacted-  The  compaction 
step  ends  when  no  conditional  jumps  can  be  moved  in  Bcompacted- 

The  major  criticism  that  can  be  formulated  against  the  SRDAG  algorithm  is  having  to  consider  a  singly 
rooted  dag  for  every  block  that  is  compacted.  This  entails  a  serious  execution  overhead,  which  was  confirmed 
by  the  experiments  performed  by  Su  [Su85]  in  the  context  of  microprogramming. 
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4.4      Percolation  Scheduling 

Percolation  scheduling  (PS)  [Nicolau85a,Nicolau85b,Aiken88a]  is  a  global  compaction  technique  developed 
by  Nicolau  as  a  result  of  his  experience  with  trace  scheduling  [Nicolau84a].  PS  is  a  simplification  and 
generalization  of  the  ideas  behind  TS,  transformations  have  been  formalized  and  proved  correct  [Nicolau85a]. 
PS  is  a  hierarchy  of  semantic-preserving  transformations  defined  in  terms  of  an  abstract  VLIW  machine 
model  with  infinite  resources  (section  4.4.1).  Core  transformations  (section  4.4.2)  are  the  lowest  level  in  the 
PS  hierarchy.  Higher  levels  in  the  hierarchy  guide  the  uniform  application  of  core  transformations  to  "perco- 
late" operations  towards  the  top  of  the  program  (hence  the  name  of  percolation  scheduling)  (section  4.4.2). 
An  implementation  of  the  VLIW  machine  model  of  PS  is  presented  in  section  4.4.3. 

4.4.1      The  Abstract  Machine  Model 

The  abstract  VLIW  machine  Ai  put  forward  by  Nicolau  has  infinite  resources.  An  instruction  /  is  of  the 
form: 

L,  :  [Op,,  Cji,  Next,] 

where  Lj  and  Nexti  are  instruction  labels,  Opi  is  a  set  of  RISC-like  operations,  and  Cji  is  a  set  of  side- 
effectless  conditional  jumps^  constrained  to  form  a  decision  tree.  M  can  execute  one  instruction  every  cycle. 
Instruction  execution  proceeds  as  follows: 

1.  Opi  ^  4>:  the  operations  in  Op,  are  evaluated  in  parallel  by  first  reading  all  the  needed  variables  and 
then  performing  all  the  writes.  This  is  consistent  with  the  way  registers  handle  simultaneous  reads  and 
writes.  Opj  is  not  allowed  to  contain  two  operations  which  write  to  the  same  variable. 

2.  Cji  =  4):  the  label  of  the  next  instruction  is  given  by  Nexti . 

"^-  Cjl  ^  4>'-  the  label  of  the  next  instruction  is  determined  by  the  path  in  the  decision  tree  which  is  true. 

An  instruction  /  can  be  portrayed  by  a  binary  tree  T,  [Ebcioglu88].  T/'s  root  is  labeled  L/,  and  each  of  its 
leaves  is  marked  with  the  label  of  some  instruction  to  which  /  could  potentially  branch.  Each  internal  node 
is  labeled  with  a  conditional  jump.  The  interior  nodes  are  organized  according  to  the  binary  tree-structure 
imposed  by  Cji.  If  Opi  is  not  empty  T/'s  root  has  degree  one  and  its  outgoing  edge  is  labeled  with  the 
operations  in  Opj.  The  following  instruction  for  instance,  can  be  represented  by  the  tree  in  figure  12. 

Ii:   [  {c5:=(i<n);  b:=x(i);  i:=i-|-l}, 
{IF  cl  THEN 

IF  (NOT  c2)  THEN  Lj 
ELSE  L3 
ELSE  L7}.—] 

A  program  of  A1  is  represented  by  a  flow  graph  where  each  vertex  contains  a  single  M  instruction.  Such 
a  representation  is  termed  a  parallel  program  graph.  Note  that  there  may  be  more  than  two  edges  out  of  a 
parallel  program  graph  vertex  as  a  result  of  the  decision  tree  structure  of  the  instruction  contained  in  it. 

'Conditional  jumps  are  only  allowed  to  test  boolean  variables,  that  is  tests  of  the  form  if  X  =  0  . . .   are  illegal  in  the  PS 
formalism,  and  have  to  be  converted  to  C  :=  (x  =  0);  if  C This  is  a  standard  approach  in  the  VLIW  world. 
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Figure  12:  Tree  representation  of  the  previous  instruction. 


The  main  difference  between  the  model  adopted  by  Nicolau  and  the  model  used  in  the  ELI,  is  the  way 
conditionals  are  handled.  In  the  ELI  only  a  vine  of  conditionals  is  allowed,  whereas  the  PS  model  allows  for 
an  arbitrary  decision  tree  (see  figure  13). 

4.4.2      The  Core  and  Scheduling  Transformations 

PS  is  composed  of  several  program  transformation  layers.  The  goal  of  the  high  level  layers  is  to  rearrange 
a  program  so  that  core  transformations  (the  lowest  PS  layer)  are  readily  applicable.  Example  of  high  level 
transformations  are  variable  renaming,  loop  interchange,  etc.  This  section  will  only  present  the  bottommost 
2  layers:  the  core  and  scheduling  transformation  layers. 

In  the  remainder  of  this  section  we  assume  acyclic  program  graphs,  PS  can  easily  be  extended  to  handle 
cyclic  code  by  preventing  core  transformations  to  cross  back  edges.  Loops  can  be  compacted  by  unrolling 
and  applying  PS.  The  section  on  software  pipelining  will  introduce  better  techniques  (some  based  on  PS)  to 
handle  loops. 

Core  Transformations.  Core  transformations  operate  on  parallel  program  graphs,  they  are  extremely 
low  level.  There  are  four  core  transformations:  Delete,  Move-op,  Move-cj  and  Unification.  In  what  follows 
the  term  "node"  refers  to  a  node  in  the  input  parallel  program  graph. 

1.  Delete  Transformation.  The  delete  transformation  removes  unreachable  or  empty  nodes  from  the 
parallel  program  graphs  (a  node  n  is  empty  if  Opn  =  Cj,,  —  4>).  These  kind  of  nodes  are  created  as  a 
result  of  other  transformations.  See  figure  14(a). 

2.  Move-op  Transformation.  The  move-op  transformation  moves  an  operation  op  from  a  node  n^  to 
a  predecessor  node  ri).   This  transformation  can  be  performed  if  no  true  or  output  data  dependences 
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IFcl  THEN  Li  IF  cl  THEN 

ELSIF  c2  THEN  Lo  IF  c2  THEN  Li 

ELSIF  c3  THEN  L3  ELSE  L2 

END  IF 
ELSE 
:  IFc3THENL3 

ELSIF  en  THEN  L„  ELSE  L4 

ELSE  Ln+i  END  IF 

END  IF  END  IF 

(a)  (b) 

Figure  13:  (a)  Vine  of  tests  as  handled  by  ELI's  multiway  jump  mechanism,   (b)  Decision  tree  that  cannot 
be  handled  by  the  ELI. 


exist  between  the  operations  in  rii  and  op.  Also  op  cannot  be  moved  in  rii  if  op  writes  a  variable  live  in 
any  of  the  successors  of  nj  (unless  renaming  is  used).  Note  that  due  to  the  semantics  of  the  abstract 
machine,  anti-dependences  do  not  prevent  move-op  to  take  place.  Care  must  be  taken  to  ensure  that 
those  paths  which  go  through  1x2,  but  not  through  nj,  are  unaffected.  To  enforce  this  constraint,  ot's 
incoming  edges  (except  for  (ni.no))  are  set  to  point  to  a  copy  of  no.  See  figure  14(b). 

3.  Move-cj  Transforniation.  The  move-cj  transformations  moves  a  conditional  jump  cj  from  a  node 
712  lo  a  predecessor  node  n^  provided  no  flow  data  dependences  exist  between  the  operations  in  ni  and 
cj  (remember  that  cj  has  no  side-effects).  A  compensation  node  is  created,  to  ensure  that  the  paths 
passing  through  02  but  not  through  n;  are  unaffected  by  the  transformation.  See  figure  14(c)  for  a 
detailed  description  of  the  transformation. 

4.  Unification  Transformation.  The  unification  transformation  is  similar  to  code  hoisting.  Assume 
the  successors  ni,...,nt  of  a  node  m  all  contain  the  same  operation  op.  op  can  be  simultaneously 
moved  form  ni,...,nit  to  m  if  there  is  no  true  or  output  data  dependence  between  op  and  any  of 
ni's  operations.  The  unification  transformation  can  be  performed  even  when  some  of  the  successors 
of  m  (call  them  A'l, .. .,  A'l)  do  not  contain  op  in  their  operation  set,  provided  op  can  be  inserted  in 
A'l, . . . ,  A'(  without  modifying  the  program  semantics,  that  is  the  variable  written  by  op  is  not  live  in 
the  nodes  reachable  from  A'l, . . . ,  A',.  As  usual  paths  passing  through  any  of  the  n.s  but  not  through 
m  must  be  unaffected,  and  appropriate  node  copies  have  to  be  inserted.  See  figure  14(d).  Note  that 
the  unification  transformation  degenerates  to  the  move-op  transformation  for  k  =  \. 

Scheduling  Transformations.  To  apply  core  transformations  we  need  some  sort  of  high  level  guidance. 
Scheduling  transformations  are  responsible  for  compacting  a  program.  We  will  introduce  the  two  most  useful 
scheduling  transformations:  migrate  and  greedy- compact^. 

The  greedy-compact   transformation  is  a  modification  by  Ebcioglu  [Ebcioglu88,Ebcioglu87]  of  the  original    compact-path 
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Figure  14:  Only  edges  which  are  actively  involved  in  transformations  are  labeled,  (a)  Delete  transformation, 
(b)  Move-op  transformation,  (c)  Move-cj  transformation,  (d)  Unification  transformation. 
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Migrate  Transformation.  Migrate{op,n2,ni)  moves  an  operation  op  or  a  conditional  jump  cj  as  far 
up  in  the  parallel  program  graph  as  data  dependences  permit.  Before  moving  an  operation  op,  from 
node  TiT  to  node  tii  we  move  up  all  the  copies  of  op  that  are  in  nodes  reachable  from  rii  (deepest  first). 
Unifications  are  performed  whenever  possible,  not  only  to  reduce  the  risks  of  code  explosion,  but  also 
to  allow  self  dependent  operations  to  move  up  simultaneously.  If  copies  of  op  can  be  moved  to  nodes 
adjacent  to  n2,  unification  is  invoked  to  move  op  from  no  to  ni,  otherwise  move-op  is  used.  Renaming 
transformations  are  used  to  enhance  the  applicability  of  core  transformations. 

Greedy-compact  Transformation.  Greedy-compact  is  basically  a  greedy  application  of  the  migrate 
transformation.  The  parallel  program  graph  is  explored,  level  by  level,  until  all  operations  have  been 
pushed  up  as  much  as  possible,  with  the  migrate  transformation. 

As  an  example  the  code  given  in  figure  5(b)  is  compacted  with  greedy-compact. 

Lq:   [{cl:=(a^O.O);  tl:  =  b*b:  t2:  =  a*c;  t7:=2.0*a;  t5:=-b:  Ok:=True,  Num.Roots  =l},(^.Li] 
L,:  [{t3:=4  0*t2},{IF  cl  THEN  lo  ELSE  L3  } ,— ] 
L2:  [{Ok:  =  False},<)!.,£'x77] 
L3:  [{D:=tl-t3}..^.L4] 
L4:   [{c2:=(D<0.0);  c3:=(D>0)}>,L5] 
L5:  [(/>,{IF  c2  THEN  Le 
ELSE 

IF  cSTHEN  L7  ELSE  Ls},  — ] 
Le:  [{Ok  =fa\se].4>. Exit] 
Lt.  [{Num.Roots:=2}>,L8] 


Ls:  [{t4 
L9:  [{t6 
I,o:[{xl 
Exit: 


=sqrt(D)},<^,L9] 

=t5-t4;  \.S:=lS+tA].4>,Lio] 

=t6/t7;  x2:=i8  /  a  }.<i).  Ex  it] 


Note  that  as  a  result  of  unification  operations  t7:=2.0*a  and  t5:=-b  do  not  have  to  be  duplicated  in  L7. 

Percolation  scheduling  will  always  produce  code  which  is  as  good  as  the  one  generated  by  TS,  but  with  the 
advantage  that  less  compensation  code  is  generated.  PS  does  not  rely  on  any  kind  of  branching  probabilities, 
programs  are  compacted  by  executing  operations  as  early  as  possible. 

4.4.3      ROPE 

To  take  full  advantage  of  PS,  Karplus  and  Nicolau  [Karplus85,Karplus86]  have  designed  a  machine  (ROPE) 
implementing  the  decision  tree  type  of  branching  introduced  by  the  parallel  program  graph  model. 

As  the  machine  has  remained  a  paper  design  the  exact  structure  of  the  data  path  is  of  little  importance 
and  is  similar  to  that  of  the  archetype  VLIW  machine  presented  in  the  introduction  (figure  15).  No  interrupt 
handling  for  fast  context  switching  is  provided,  ROPE  uses  a  cheap  slow  processor  to  handle  instruction 
traps,  page  faults  and  I/Os.  The  CPU  might  have  to  be  frozen  if  the  slow  processor  has  not  returned  the 
instructions  or  data  needed  in  time. 


transformation  of  Nicolau.    While  compact-path  uses  branching  probabilities,  and  concentrates  on  compacting  a  single  path, 
greedy-compact  has  the  advantages  of  compacting  the  whole  program,  and  needs  no  branching  probabilities. 
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Figure  15;  ROPE  block  diagram.  Tiiere  are  2*  instruction  memory  banks  and  pre-fetch  units  (k  is  a  machine 
parameter). 


What  makes  ROPE  an  original  design  is  the  way  multiple  conditional  jumps  are  handled.  ROPE  was 
designed  to  use  slow  inexpensive  memory  (dynamic  RAM)  to  support  the  multiway  branching  capability  of 
PS,  instruction  caches  are  not  used.  The  program  memory  has  to  provide  an  instruction  every  cycle.  Since  the 
cycle  time  of  dynamic  RAMs  might  be  as  much  as  10  times  slower  that  of  the  CPU,  instruction  memory  banks 
are  interleaved.  This  standard  approach  works  well  for  straight-line  code,  but  imposes  untollerable  delays 
when  a  jump  occurs.  To  avoid  jump  penalties,  each  of  the  2*^  memory  banks  is  connected  to  an  intelligent 
pre-fetch  unit.  Pre-fetch  units  are  connected  in  a  ring  (ROPE  stands  for  Ring  Of  Pre-fetch  Elements,  see 
figure  15). 

Pre-fetch  units  can  be  active,  ready,  busy  or  idle.  A  pre-fetch  unit  is  active  when  the  current  instruction 
is  being  output,  ready  when  memory  fetches  have  completed  (and  the  pre-fetch  unit  is  waiting  its  turn  to 
output  the  instruction),  busy  when  memory  reads  are  in  progress,  and  is  otherwise  idle. 

For  straight  line  code  a  pre-fetch  signal  is  passed  in  cyclic  fashion  from  pre-fetch  unit  i  to  prefetch  unit 
{i  +  1)  mod  2*.  To  avoid  delays  this  signal  has  to  be  issued  well  before  instructions  are  executed.  Let  d  be 
the  ratio  (instruction  memory/CPU)  cycle  time.  Once  the  pre-fetch  signal  is  sent  to  unit  0  and  propagated 
from  one  pre-fetch  unit  to  the  next  one  is  able  to  obtain  one  instruction  per  cycle  after  a  delay  of  c(—  1  cycles 
(provided  d  <  2*^). 

In  order  to  process  arbitrary  jumps  efficiently,  pre-fetch  units  are  informed  ahead  of  time  that  they  may 
be  requested  to  output  a  specific  instruction.  For  instance  consider  an  instruction  /  containing  the  following 
multiway  branch: 
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IF  cl  THEN 

IF  c2  THEN  Addri 
ELSE  Addi-, 
END  IF 
ELSE 

IF  c3  THEN  Addrz 

ELSE  Addr^ 
END  IF 
END  IF 

where  Addri , . .  .  ,  Addr^  are  the  four  target  addresses;  d  cycles  before  instruction  I  is  scheduled,  the  compiler 
schedules  four  special  pre-fetches  (which  we  will  call  jump-pre-fetches)  to  pre-fetch  units  (Addri  mod  2*), 
{Addr2  mod  2*),  (Addr^  mod  2*)  and  (Addr^  mod  2*).  The  compiler  has  to  ensure  that  these  four  units  are 
different;  code  duplication  may  be  necessary.  Each  of  the  four  jump-pre-fetched  units  receives  not  only  an 
address,  but  also  a  condition  mask  (jump-pre-fetched  unit  (Addi-^  mod  2*),  for  instance  receives  ((NOT  cl) 
AND  (c3))  as  a  condition  mask).  As  soon  as  instruction  /  executes  the  current  values  of  the  condition  bits 
are  output  on  the  control  bus  (see  figure  15).  The  jump-pre-fetched  unit  whose  condition  mask  matches  the 
value  on  the  control  bus  will  output  the  instruction  it  has  fetched.  If  the  target  unit  is  still  busy  it  will  freeze 
the  CPU.  Once  the  current  instruction  has  been  generated,  control  is  passed  to  the  pre-fetch  unit  following 
the  target  (assuming  no  jumps  in  the  current  instruction).  Thus  a  jump-pre-fetched  unit  has  to  signal  to 
the  units  to  its  right  that  they  may  be  requested  to  output  an  instruction  in  the  coming  cycles.  This  is 
achieved  by  generating  a  normal  pre-fetch  pulse  (transmitted  from  one  unit  to  the  next)  one  cycle  after  the 
jump-pre-fetch  is  received. 

Let  us  compute  the  number  of  pre-fetch  units  that  are  needed  to  support  an  m-way  jump.  On  the  cycle 
after  the  condition  codes  have  been  put  on  the  control  bus,  each  of  jump-pre-fetched  units  has  to  be  ready 
(since  we  don't  know  in  advance  which  will  be  the  target).  Furthermore  each  of  the  d—  1  units  to  the  right  of 
the  jump-pre-fetched  units  must  have  started  p re-fetching.  We  therefore  need  in  +  m*{d—\)  —  md  pre-fetch 
units.  Karplus  and  Nicolau  suggest  that  a  machine  with  32  or  64  pre-fetch  units  would  be  appropriate. 

As  the  compiler  should  be  allowed  to  schedule  several  sets  of  jump-pre-fetches  simultaneously,  each  jump- 
pre-fetched  units  is  transmitted  a  label  along  with  the  address  and  control  mask.  This  label  is  the  same  for 
each  of  the  jump-pre-fetched  units  involved  in  a  same  muUiway  branch.  When  the  condition  bits  are  output 
on  the  control  bus,  the  label  of  the  corresponding  multiway  branch  is  also  transmitted  to  the  pre-fetch  units. 
The  unit  which  has  a  match  for  both  the  label  and  condition  mask  is  allowed  to  output  the  instruction  it 
has  fetched.  This  insures  that  no  conflicts  exists  between  two  jump-pre-fetched  units  which  happen  to  have 
same  condition  mask,  but  relate  to  different  instructions. 
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4.5      Region  Scheduling 

While  TS  and  PS  operate  on  program  flow  graphs,  region  scheduling  (RS)  [Gupta87c,Gupta88,Gupta87a] 
manipulates  program  dependence  graphs  (PDG)  (section  2.2).  The  computational  model  used  is  simpler 
than  TS'  or  PS':  instructions  are  allowed  a  single  branch  operation.  RS  works  by  "fairly"  redistributing  the 
parallelism  available  in  a  program.  Let  Op(/Z,)  be  the  set  of  operations  directly  depending  on  a  PDG  region 
node  Ri  and  let  H(Ri)  be  the  height  of  the  data  dependence  dag  of  the  operations  in  Op{R,).  Define  7r(/ii), 
the  parallelism  present  in  a  region  node  R,,  to  be: 

number  of  operations  in  /J,  \Op{R^)\ 

7r(/l,)  = 


longest  data  dependence  chain  of  operations  in  R^  IH^) 

7r(/?,)  is  the  average  number  of  operations  that  can  be  executed  in  parallel  in  any  given  cycle  for  region  /?,. 
RS  works  by  repeatedly  transforming  the  PDG,  redistributing  the  parallelism  available  in  the  program 
from  region  nodes  containing  high  levels  of  parallelism  to  region  nodes  containing  low  levels  of  parallelism. 
The  maximum  number  of  operations  that  a  machine  M  can  actually  perform  per  cycle  is  determined  by  the 
architecture,  and  is  denoted  Tr{M).  The  goal  of  RS  is  to  make  Tr{R,)  approach  ir(M)  for  all  regions,  so  that 
system  resources  are  used  efficiently.  Structurally  RS  is  similar  to  PS:  a  set  of  core  transformations  (called 
region  transformations)  are  repeatedly  applied  using  higher  level  guidance  rules. 

4.5.1      Region  Transformations 

There  are  four  kinds  of  region  transformations:  region  deletion,  loop  splitting/unrolling,  forward/backward 
code  motion  and  region  copying/collapsing.  As  in  TS  and  PS,  some  transformations  require  the  introduction 
of  compensation  code  to  preserve  semantic  equivalence  with  the  original  program.  A  descri|)tion  of  each 
region  transformation  follows. 

1.  Region  Deletion.  If  a  region  R,  has  a  single  incoming  edge  e  =:  (/?j ,  /?, ),  the  deletion  transformation 
Tdei{f^)  can  be  applied  and  region  node  /?,  can  be  deleted  (see  figure  16(a)).  Redundant  regions  appear 
as  a  result  of  other  transformations. 

2.  Loop  Splitting/Unrolling.  FOR  loops  containing  no  conditionals  and  having  known  loop  bounds 
can  be  split  or  unrolled.  T,p(,-,(L,-, /?,)  splits  the  loop  in  region  L,  into  two  loops:  one  with  m  iterations 
is  attached  to  region  R^,  and  the  other  with  n  —  m  iterations  (where  n  is  the  known  number  of  initial 
iterations)  remains  attached  to  L,  (see  figure  16(b)).  In  order  for  T,p/,i  to  apj^ly  we  must  initially  have: 

Tr(R,)  <  7r{M)  <  tt{L,) 

The  amount  of  splitting,  m,  is  the  first  integer  (<  n)  such  that  the  above  double  inequality  no  longer 
holds.  Standard  loop  unrolling  {Tunroii{Li))  can  be  used  to  increase  the  parallelism  within  a  loop  body. 
The  amount  of  unrolling  performed  by  Tunroii  is  determined  by  the  machine  tt{M),  and  the  parallelism 
already  available  in  Li  (the  amount  of  unrolling  u  is  the  biggest  integer  such  that  after  unrolling  the 
loop  u  times  7r(L,)  <  Tr{M)).  This  is  din"erent  than  what  is  done  in  TS  where  the  amount  of  unrolling 
is  arbitrary. 
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3.  Forward/Backward  Code  Motion.  Forward/backward  code  motion  transformations  move  code 
from  regions  with  excess  parallelism  to  regions  with  insufficient  parallelism.  This  type  of  transformation 
can  only  be  applied  to  structured  nodes.  A  region  node  is  structured  if  the  code  represented  by  the 
subgraph  rooted  at  the  node  is  structured.  Tmovei^^j ,  ^i)  moves  code  forward  or  backward  from  a 
region  node  Rj  to  an  adjacent  region  node  /?,  (see  figure  16(c)  and  (d)).  It  can  be  applied  only  if: 

7t{R,)  <  n(M)<  w{Rj) 

The  set  of  operations  that  can  be  moved  is  determined  by  data  dependences.   Operations  are  moved 
from  Rj  to  R,  until  the  above  inequality  no  longer  holds. 

4.  Region  Copying/Collapsing.  Region  copying/collapsing  transformations  merge  regions  with  in- 
sufficient parallelism.  The  region  copy  transformation,  T^opyiRi)  copies  code  from  a  region  /?,  into 
each  of  R,'s  predecessors  (figure  16(e)).  This  transformation  eliminates  the  need  for  an  unconditional 
jump  instruction.  The  region  collapsing  transformation,  Tmerge{R,,Tes1  ,Rj)  merges  /?,,  Test,  Rj ,  to 
form  a  new  operation  node  opk  (figure  16(f)).  After  fusion  the  operations  in  regions  /?,,  Rj  and  Test 
are  treated  as  a  single  unit  in  further  transformations.  The  merge  transformation  is  applied  only  if 
v(Ri)  +  7r(/?j)  <  7r(A^),  and  if  the  architecture  allows  discarding  of  values  (see  section  5.2.1),  which 
would  allow  the  operations  in  R,  and  Rj  to  execute  concurrently.  The  outcome  of  Test  would  decide 
which  one  of  the  two  sets  of  operations  would  be  aborted. 


4.5.2      Guidance  Rules  for  Applying  Region  Transformations 

The  above  transformations  can  be  applied  repeatedly  to  increase  the  parallelism  present  in  a  region  up  to 
the  point  where  machine  resources  are  exhausted.  The  repeated  application  is  necessary  because  a  transfor- 
mation may  enable  the  application  of  another  transformation.  The  RS  algorithm  has  two  phases:  a  local 
and  a  global  transformation  phase. 

Local  transformation  phase.  The  local  transformation  phase  considers  adjacent  region  nodes,  and  needs 
only  a  local  view  of  the  PDG.  It  proceeds  as  follows: 

1.  Let  S  =  {R,\tt{R,)  <  v{M)  —  t},  where  e  is  some  architecture  dependent  threshold.    5  is  the  set  of 
region  nodes  with  insufficient  parallelism. 

2.  For  all  /?,  €  5,  apply  one  of  the  following  transformations  in  the  order  given  below. 

(a)  T,pii,{L,,  /?,)  if  /?,  has  an  adjacent  loop  region  L,  with  too  much  parallelism; 

(b)  Ti,„rou{R,)  if  Rt  is  a  loop  region  node; 

(c)  Tmove(Rj,  R,)  if  R,  has  an  adjacent  region  Rj  with  too  much  parallelism; 

(d)  ZopyiR.) 

(e)  TmergAfi,,Test,  Rj)  i{  niR,)  +  t:(Rj)  <  n(M). 

If  the  transformation  applied  hasn't  transferred  enough  parallelism  /?,  remains  in  5,  and  is  subsequently 
reconsidered  for  other  transformations. 
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(a) 


(b) 


(e) 


Figure  16:  (a)  Region  deletion,   (b)  Loop  splitting,   (c)  Forward  code  motion,    (d)  Backward  code  motion, 
(e)  Region  copying,  (f)  Region  collapsing. 
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3.  After  performing  one  of  llie  above  transformations,  Tjej  might  have  to  be  invoked  to  clean  up  the  PDG. 

4.  Repeat  step  2  until  5  =  (/>  or  no  transformations  apply. 

Global  transformation  phase.  This  phase  is  invoked  once  local  transformations  have  been  applied.  Global 
transformations  consider  the  unbalanced  paths  of  the  PDG.  A  path  P  is  said  to  be  unbalanced  if  its  endpoints 
are  two  regions  nodes,  one  with  insufficient  parallelism  and  the  other  with  excess  parallelism.  The  global 
transformation  phase  proceeds  as  follows: 

1.  Let  S  =  {R,\n{R,)  <  7r(M)  -  e}. 

2.  For  all  R,  £  S  find  the  shortest  path  P  =  R,,  R,  +  i,  ■  ■  ■ ,  Rk  such  that  Tr(Rk)  >  7r(M)  or  n(Rk)  can  be 
made  so  by  applying  Tjp(,(,  T^^nroii,  Tcopy  or  Tmerge  locally  to  R^.  Apply  the  series  of  transformations: 

T^moveiFik,  Rk-l),  Tmove{F(k-l,  Rk-2)<    ■  ■  ■  T^movei  Ri  +  1  ^  f^i) 

R^  remains  in  5  as  long  as  7r(/?, )  <  tt{M)  —  e. 

3.  Clean  up  the  PDG  by  means  of  Tdet- 

4.  Repeat  step  2  until  S  =  4>  or  no  transformations  apply. 

After  global  transformations  have  been  applied  code  size  might  increase  exponentially,  but  the  code  explosion 
danger  is  not  as  acute  as  in  TS. 

4.5.3      From  Region  Scheduling  to  Code  Generation 

After  the  PDG  has  been  transformed  and  the  parallelism  redistributed  with  RS,  the  code  generator  can 
schedule  the  individual  regions  and  generate  code  for  the  target  architecture.  Code  generation  is  carried  out 
one  region  at  a  time.  The  order  in  which  regions  are  processed  is  the  reverse  topological  order  imposed  by 
the  CDG  (back  edges  are  ignored).  Code  for  a  region  node  /?,  can  be  generated  by  using  list  scheduling  on 
the  operations  that  are  directly  control  dependent  on  /?, . 

Gupta  has  used  RS  to  compile  programs  for  a  reconfigurable  LIW  (RLIVV)  architecture  [Gupta87b,Gupla87a]. 
The  peculiarity  of  the  RLIW  architecture  developed  by  Gupta  resides  in  the  ability  to  reconfigure  connections 
between  ALUs  in  each  cycle.  ALUs  are  disposed  in  an  n  x  m  grid.  The  ALUs  in  row  ;'  (0  <  ?  <  n  —  1)  are 
connected  to  the  ALUs  in  row  i  +  1  via  a  unidirectional  crossbar.  Each  column  j  (0  <  j  <  m)  of  ALUs  is 
connected  to  memory  module  j  (there  are  as  many  memory  modules  as  there  are  columns  of  ALUs).  Data 
flows  uniformly  from  the  first  ALU  row  to  the  last.  Connections  between  successive  rows  can  be  changed 
every  cycle.  Memory  modules  are  all  connected  to  an  "intermemory  module  transfer  request  handler"  whose 
role  is  to  transfer  data  from  one  memory  module  to  another  when  memory  disambiguation  has  failed  at 
compile  time. 
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5      Software  Pipelining 

All  compilation  techniques  seen  so  far  exploit  loop  level  parallelism  by  using  loop  unrolling  in  conjunction 
with  global  compaction.  While  this  simple  approach  is  sometimes  adequate,  it  is  in  general  not  the  best 
way  to  speedup  loops.  Software  pipelining  proposes  to  transform  loop  bodies  so  that  loop  iterations  can 
be  started  every  cycle.  The  name  software  pipelining  comes  from  the  similarity  with  hardware  pipelining, 
where  operations  are  initiated  on  a  cycle  basis.  Figure  17  shows  a  three  stages  hardware  pipeline  and  its 
time  diagram.     If  we  consider  a  three  instruction  loop  such  as  the  one  in  figure  18(a)  as  a  tliree  stages 
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Figure  17:  (a)  A  3-stage  hardware  pipeline,  (b)  Time  diagram  of  the  hardware  pipeline.  Time  elap-ses 
vertically,  stages  horizontally.  Boxes  are  labeled  Sx(oy)  meaning  that  in  a  given  time  slot,  stage  Sx  is 
executing  operation  oy. 

software  pipeline  (figure  18(b)),  that  is  we  initiate  a  new  iteration  every  cycle,  we  obtain  the  time  diagram  of 
figure  18(c).  Note  the  similarity  with  hardware  pipelining.  From  the  time  diagram  we  see  that  the  loop  can  be 
restructured  so  as  to  achieve  the  effect  of  a  software  pipeline.  Figure  18(d)  gives  the  software  pipelined  loop. 
The  first  two  and  last  two  instructions  of  the  code  are  respectively  called  loop  prelude  (or  prolog)  and  loop 
postlude  (or  epilog).  Machine  pipelines  get  filled  during  loop  preludes,  they  are  in  their  steady  state  while  the 
pipelined  loop  body  is  executing,  and  they  drain  during  loop  postludes.  Note  that  unrolling  and  compacting  a 
loop  does  not  have  a  similar  efi'ect,  as  machine  pipelines  usually  drain  every  viiroll  iterations,  where  unroll  is 
the  amount  of  loop  unrolling.  Furthermore,  software  pipelining  usually  produces  quite  compact  loop  bodies, 
while  loop  unrolling  does  not.  This  can  be  a  serious  problem  if  the  instruction  cache  is  small. 

It  is  not  always  possible  to  start  an  iteration  every  cycle,  as  iteration  i  +  1  might  depend  on  iteration  i. 
In  the  general  case  loop  carried  dependences  (and  resource  availability)  dictate  the  delay  (called  initiation 
interval  [RauSl])  that  one  needs  to  respect  before  starting  the  next  iteration.   The  goal  of  a  compiler  is  to 
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minimize  such  delay.  Software  pipelining  is  more  general  than  vectorization  since  all  vectorizable  loops  can  be 
software  pipelined  while  the  contrary  does  not  hold.  This  is  one  important  advantage  of  VLIW  architectures 
over  vector  machines. 
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Figure  18:  (a)  A  3-instructions  loop.  Assume  no  loop  carried  dependences,  (b)  The  loop  viewed  a.s  a  3  stage 
software  pipeline.  Instruction  II  is  stage  1,  12  stage  2,  13  stage  3.  (c)  Time  diagram  of  the  software  pipeline. 
Ix(y)  means  that  at  a  given  time  the  processor  is  executing  the  Ixth  instruction  of  the  yth  iteration,  (d) 
Software  pipelined  loop. 


The  loop  pipelining  idea  is  not  new,  Kogge  [Kogge77]  used  it  for  iiandcoding  microprogrammed  pipelined 
machines,  while  recently  Cytron  has  explored  its  application  to  multi|)rocessors  [CytronSG].  Software  pipelin- 
ing algorithms  can  be  divided  into  two  classes: 

1.  straight  line  software  pipelining  techniques  (section  5.1)  pipeline  loops  with  branch  free  bodies;  they 
have  mainly  been  developed  for  horizontal  machines  with  limited  parallelism; 

2.  complex  software  pipelining  techniques  (section  5.2),  developed  for  highly  parallel  V'LIWs,  can  pipeline 
loops  containing  arbitrary  conditional  branches. 

In  what  follows  loops  are  identified  with  their  bodies  and  procedure  calls  are  assumed  to  be  inlined. 
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5.1      Straight  Line  Software  Pipelining  Techniques 

The  previous  section  has  introduced  software  pipelining  as  being  the  software  analog  of  hardware  pipelining. 
While  this  analogy  motivates  the  name,  straight  line  software  pipelining  techniques  view  loop  pipelining 
from  a  different  (although  equivalent)  angle,  based  on  the  initiation  interval  concept.  Consider  a  loop  L 
comprising  Ic  instructions  (unless  otherwise  stated,  all  loops  considered  in  this  and  the  following  subsections 
have  straight  line  bodies).  When  £  is  executed  serially  an  iteration  is  started  every  Ic  cycles  (assuming  an 
instruction  per  cycle  execution  speed).  To  speedup  £,  successive  iterations  can  be  initiated  every  t„  <  Ic. 
cycles  (<„  is  the  initiation  interval)^.  If  we  draw  a  time/iteration  execution  diagram  of  the  loop  in  the  two 
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Figure  19:  (a)  Time/iteration  execution  diagram  of  L  when  <,, 
of  £  when  <,,  <  Ic- 


—  Ic-  (b)  Time/iteration  execution  diagram 


cases  t,i  =  Ic  and  t„  <  Ic  we  obtain  figures  19(a)  and  19(b).  If  £  has  a  sufficient  number  of  iterations 
a  repeating  pattern  can  be  detected.  For  instance  consider  a  three  instruction  loop  such  as  the  one  of 
figure  18(a),  the  repeating  pattern  is  shown  in  figure  20(a),  the  software  pipelined  loop  in  20(b). 

A  simple  straight  line  software  pipelining  algorithm  which  generalizes  the  approach  illustrated  in  figure  20 
has  been  developed  by  Charlesworth  for  the  FPS-164  [CharlesworthSl],  The  FPS-164  is  an  horizontal  machine 
with  modest  parallelism,  having  one  adder  and  one  multiplier.  Charlesworth  algorithm  is  restricted  to  loops 
with  known  bounds  and  no  carried  dependences.  The  algorithm  proceeds  as  follows: 

1.  £  is  compacted  using  some  local  compaction  method  like  list  scheduling.  Let  C^  be  the  loop  obtained 
after  such  compaction. 


^As  we  will  se  later  on  the  vaJue  of  (,,  is  lightly  related  to  macliine  resource  constiaiiils  and  £'s  loop  carried  dependences. 
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2.  Determine  ricr,  the  number  of  times  the  most  frequently  used  resource  (critical  resource)  is  used  in  the 
loop.  For  the  FPS  ricr  <  in  (by  the  pigeonhole  principle).  Initially  set  ^,,  to  n^r  ■ 

3.  Let  C,p  be  the  software  pipelined  loop.  C,p  is  generated  from  Cu  by  packing  every  t,,ih  instruction 
of  Cic  into  one  C,p  instruction  (array  indices  have  to  be  adjusted).  If  resource  or  (intra-loop)  data 
dependence  conflicts  arise,  add  one  to  ta,  and  repeat  this  step  until  the  entire  Ci^  is  folded  into  Cgp. 

4.  Add  to  C,p  the  loop  exit  test,  code  to  increment  array  indices  and  generate  the  prelude  and  postlude 
to  fill  and  empty  the  software  pipeline. 

Note  that  the  body  of  £,p  has  ta  instructions.  The  value  \lc/Ut]<  called  degree  of  pipelining  (and  denoted 
Dp),  is  the  number  of  iterations  executed  in  parallel  in  the  steady  state  and  the  number  of  times  we  need  to 
iterate  C,p  to  execute  one  full  £  iteration.  Dp  is  a  close  approximation  of  the  speedup  achieved  by  software 
pipelining.    The  previous  algorithm  assumes  a  constant  initiation  interval  and  a  fixed  schedule  for  each  C 
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Figure  20:  (a)  Time/iteration  execution  diagram  of  the  loop  in  figure  18. Note  the  repeating  pattern 
I3(i);  12(i+l);  ll(i+2).  (b)  The  software  pipelined  loop. 

iteration,  but  this  need  not  be.  A  variable  initiation  interval  combined  with  variable  iteration  schedules 
might  in  fact  yield  a  better  throughput.  Rau  and  Glaeser  [RauSl]  suggest  to  u.se  a  constant  i„,  and  the  same 
schedule  for  every  iteration  to  reduce  the  size  of  the  pipelined  loop,  and  the  complexity  of  software  pipelining 
itself. 

Let  us  illustrate  Charlesworth  method  on  the  loop  of  figure  21(a),  its  intermediate  representation  is  given 
in  21(b).  The  target  machine  is  the  FPS,  we  cissume  an  execution  rate  of  an  instruction  per  cycle.  £  is 
compacted  with  bottom  up  list  scheduling,  the  result  (written  in  the  PS  formalism)  is  given  in  21(c).  The 
most  critical  resource  of  the  loop  is  the  multiplier,  hence  n^r  =  2.  Initially  t„  =  2.  We  create  C,p  from  £u 
by  packing  every  other  instruclion  of  Cic  into  an  C,p  instruction  (array  indices  have  to  be  adjusted),  the 
result  is  given  in  21(d).  Of  course  in  the  final  code  a(i+2),  b(i+2).  y(i  +  2).  z(i+2)  have  to  be  translated  into 
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i':=i+2  and  a(i'),  b(i'),  y(i'),  z(i').  After  adding  to  C,p  the  loop  exit  test,  code  to  increment  array  indices, 
the  prolog  and  epilog,  we  finally  obtain  the  software  pipelined  loop  of  figure  21(e). 

Charlesworth  straight  line  software  pipelining  method  is  very  simple.  An  early  stage  improvement  over 
Chariesworth  algorithm  is  the  unroll  pipeline  reroll  method  (URPR)  of  Su  [Su84,Su86,Mueller86]  which 
can  pipeline  loops  with  carried  dependences.  Like  Charlesworth  technique  URPR  is  limited  to  loops  with 
known  bounds.  The  idea  behind  URPR  is  as  follows:  if  operation  op2  of  iteration  i  (denoted  op2(i))  is  data 
dependent  on  operation  opi  of  iteration  i  -  1  (denoted  op\{i  -  1)),  then  a  delay  equal  to  the  execution  time 
ofopi  must  be  respected  between  the  execution  of  opi(i— 1)  and  the  execution  of  op2(j).  Such  delay  restricts 
the  value  of  <,,.  The  algorithm  assumes  an  instruction  per  cycle  execution  rate,  it  proceeds  as  follows: 

1.  C  is  compacted  using  list  scheduling.  Let  Ci^  be  the  loop  obtained  after  such  compaction. 

2.  Compute  Ti^r,  the  number  of  times  the  critical  resource  is  used  in  Cu,. 

3.  Let  instructionjnum[op),  be  the  number  of  the  instruction  to  which  an  operation  op  of  Cu  belongs. 
Consider  the  data  dependence  graph  of  £;<;•  To  respect  loop  carried  dependences  from  one  Cu  iteration 
to  the  next  a  delay  D  of: 

D  ■=  \ -\-  ma.x{instruction,nuiv{op2)  —  instruction. num{opi)  |  opi,op2  £  £(c  and  A(opi,op2)  =  1} 

must  be  respected  between  the  execution  of  successive  Cu  iterations  (A  was  defined  in  section  2.1). 
Assuming  a  single  instance  for  the  critical  resource,  ta  =  max(ncr,^) 

4.  Unroll  Phase.  Let  Ic,^  be  the  number  of  instructions  in  Cu-  Compute  Dp  =  \lc|J'"^  ^'i*^  degree  of 
pipelining.  Unroll  Ci^  to  create  Dp  loop  bodies:  Cic{i),  Cu{i  +  1), .  . .  ,  £(c('  +  Dp  —  1). 

5.  Pipeline  Phase.  The  Dp  iterations  are  pipelined  by  placing  them  into  a  grid  where  rows  represent 
machine  cycles,  and  the  columns  contain  the  successive  Cu  iterations.  Iteration  Cic{i  +  k)  is  placed 
in  column  k  starting  at  cycle  (k  ■  t„).  The  schedule  of  Cidi  +  k)  may  be  stretched  to  satisfy  loop 
carried  dependences  with  the  previously  scheduled  k  —  1  iterations  (note  that  step  3  only  takes  into 
account  loop  carried  dependences  from  one  iteration  to  the  next).  If  resource  conflicts  occur  between 
an  instruction  /  of  £(c  scheduled  in  cycle  c,  and  some  instruction  of  a  previously  scheduled  iteration,  / 
is  placed  in  cycle  c+  L  If  a  resource  confiict  still  occurs,  a  new  empty  row  is  created  between  rows  c  and 
c+  1  where  /  is  inserted.  The  above  process  is  repeated  until  all  the  Cic(i  +  k)s  have  been  scheduled. 

6.  Reroll  Phase.  The  Dp  iterations  are  rerolled  into  a  new  loop  body  which  includes  all  tiie  instructions 
of  £;c,  and  is  of  minimal  length.  Such  loop  body  is  determined  by  exhaustive  search.  After  adding  the 
exit  test,  and  generating  the  loop  prelude  and  postlude  we  finally  obtain  the  software  pipelined  loop. 

As  an  example  consider  a  loop  with  no  carried  dependences  whose  compacted  loop  £/f  has  the  resource 
requirements  shown  in  figure  22(a)  (  +  ,  x,  H-,  ^,  respectively  mean  floating  point  addition,  multiplication, 
division  and  square  root).  Since  there  are  no  loop  carried  dependences  and  n^r  =  3,  t„  —  3.  \\'e  need  to 
pipeline  Dp  —  [7/3]  =  3  consecutive  iterations.  The  result  of  the  pipelining  pha.se  is  given  in  22(b).  After 
rerolling  we  obtciin  the  loop  body  given  in  22(c). 
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FOR  i  IN  1  .100  LOOP 
x(i)  :-  a(i)*y(i)+b(i)*z(i); 
END  LOOP; 
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t2 
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Figure  21:  (a)  The  loop  to  be  pipelined,  (b)  The  intermediate  code  representation  of  the  loop's  body,  (c) 
The  loop  after  local  compaction  (bottom-up  list  scheduling),  (d)  The  body  of  the  software  pipelined  loop, 
(e)  The  final  software  pipelined  loop. 
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Although  URPR  handles  loop  carried  dependences,  it  does  so  in  a  very  peculiar  fashion  as  the  value  of /;; 
only  reflects  those  dependences  which  are  carried  from  one  iteration  to  the  next.  Also  resource  constraints 
are  satisfied  by  stretching  fixed  iteration  schedules  which  are  imposed  by  the  local  compaction  phase  of 
URPR.  This  appears  to  be  somewhat  awkward.  Generalized  URPR  (GURPR)  [Su87]  extends  URPR  to 
handle  practically  any  kind  of  loop.  A  list  of  techniques  is  given  to  patch  URPR  in  the  event  of  conditional 
branches,  procedure  calls,  etc.  It  however  remains  that  general  loop  carried  dependences  are  not  adequately 
handled,  as  GURPR  is  built  on  top  of  URPR 
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Figure  22;  (a)  Resource  requirements  of  some  loop  after  compaction  with  list  scheduling,  (b)  The  loop  after 
the  pipelining  phase,  (c)  The  rerolled  loop  body. 

The  pipelining  algorithms  of  Charlesworth  and  Su  illustrate  the  initial  attemjUs  to  solve  the  straight 
line  software  pipelining  problem  (formally  defined  in  section  5.1.1).  Another  early  attempt  was  made  by 
Touzeau  [Touzeau84].  His  method  adequately  pipelines  loops  with  arbitrary  carried  dependences,  but  can 
only  process  single  statement  loops.  The  algorithm  is  similar  to  the  more  general  technique  of  Lam  presented 
in  section  5.1.2:  given  an  initiation  interval,  Touzeau 's  method  tries  to  rearrange  the  operations  of  the  loop 
processed  such  that  resource  and  data  dependence  constraints  are  satisfied.  The  smallest  initiation  interval 
for  which  a  valid  loop  schedule  can  be  unveiled  is  determined  by  binary  search. 

After  formally  defining  the  straight  line  software  pipelining  problem  in  section  5.1.1  we  present  two  recent 
straight  line  software  pipelining  algorithms  developed  by  Lam  [LamSS]  (section  5.1.2)  and  Aiken  and  Nicolau 
[AikenSSb]  (section  5.1.4),  which  give  optimal  or  quasi-optimal  results. 
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5.1.1      Formal  Definition  of  the  Straight  Line  Software  Pipelining  Problem 

Let  tii(i,i-\-  1)  be  the  initiation  interval  between  iterations  i  and  j+  1*,  and  let  t,,(i,i-\-  Ic)  be  tlie  initiation 
interval  between  iterations  i  and  i  +  k: 

ti,ii,  i + k)  =  Y^  t„{i  +  h,i+ fi  +  \) 

The  most  general  definition  of  the  straight  line  software  pipelining  problem  (SLSP  problem)  is:  given 

1.  a  machine  M  with  a  set  of  resource  constraints, 

2.  a  loop  C  comprising  n  iterations 

3.  £  s  data  dependence  graph, 

the  goal  is  to  determine  schedules  cti,  (Tt,  . . . ,  (t, ,  , . . ,  cr^  and  initiation  intervals /,,(!, 2),  . . . ,  /,,(('—  1,  /),  .  . . , 
t,i(n  —  l,n)  respectively  for  iterations  1,  2,  .  .  . ,  i,  .  . .  ,  7)  such  that  M  resource  constraints  and  C  data  de- 
pendences are  satisfied,  and  £  executes  in  the  shortest  possible  time,  i.e. 

max  {i,i(l,i)  +  leri{(7,))      is  minimum 

1  <i  <n 

(/en((T,)  is  the  length  of  schedule  a,  (see  section  3.1  for  notations))^. 

According  to  the  above  definition  every  iteration  could  potentially  have  a  different  schedule,  this  would 
entail  very  poor  space  efficiency  since  the  code  for  each  iteration  would  have  to  be  fully  specified.  This 
has  led  researchers  like  Rau  [Rau81]  and  Charlesworth  [CharlesvvorthSl]  (see  previous  section)  to  define  a 
restricted  version  of  the  general  SLSP  problem  which  imposes  ai  =  a^  =  ■  ■  ■  —  <^,  =  ■  ■  =  (^n  —  c  and 
t,i(\,2)  =  . .  .  =  t,i(i  -  1,  j)  =  . .  .  =  t„{n  -  1,  n)  =  <„.  The  goal  of  the  restricted  SLSP  problem  is  to  find  the 
shortest  loop  schedule  a  and  the  minimum  consiant  initiation  interval  ta,  such  that  M  resource  constraints 
and  C  data  dependences  are  respected  when  a  new  iteration  is  started  every  ta  cycles.  Given  a  loop  schedule 
a  euid  a  constant  initiation  interval  <,,  that  satisfy  resource  and  data  dependence  requirements,  extremely 
compact  pipelined  loop  code  can  be  generated  (like  in  Charlesworth  algorithm)  by  packing  together  every 
<jith  instruction  of  cr . 

Unfortunately  while  optimal  solutions  to  the  general  SLSP  problem  guarantee  time  optimal  results,  op- 
timal solutions  to  the  restricted  SLSP  problem  do  not.  To  see  this  consider  the  loop  of  figure  23(a),  who.se 
data  dependence  graph  is  given  in  23(b).  If  we  assume  a  machine  with  infinite  resources  and  a  1  cycle  delay 
for  all  operations,  the  optimal  solution  to  the  restricted  SLSP  problem  is  given  in  23(c).  The  solution  is 
not  time  optimal,  it  takes  2  ■  n  cycles  to  execute  the  whole  loop.  The  optimal  solution  to  the  general  SLSP 
problem  is  given  in  23(d),  it  requires  [2^11  time  steps. 

We  now  formally  define  the  general  and  restricted  SLSP  problems.  The  input  to  the  two  problems  is 
similar  to  the  local  compaction  problem  (LCP)  input,  except  for  the  fact  that  LCP  deals  with  acyclic  instead 
of  cyclic  code  (refer  to  section  3.1  for  notations). 


The  injlialion  interval  (,,(i,i  +  1)  is  the  delay  that  separates  the  beginning  of  execution  of  iterations  i  and  i  +  1. 
For  eill  iteration  i  (1  <  i  <  n)  and  for  all  operation  o-p  of  C,  n ,[op)  gives  the  rf/a(ii>f  schedule  lime  of  op  within  iteration  i. 
The  absolute  schedule  lime  of  op  during  iteration  i  is  (,,(1,  t)  +  a,  (op). 
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FOR  i  IN  l.n  LOOP 

a(i)  :=  c(i-l);  --A 
b(i+l)  :=  a(i);  --B 
c(i):=b(i);  --C 

END  LOOP; 
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Figure  23:  (a)  A  loop,  (b)  Its  data  dependence  grapii.  Dashed  edges  indicate  carried  dependences  from  one 
iteration  to  the  next  (A(e)  =  1).  (c)  Optimal  solution  to  the  restricted  SLSP  problem.  <„  and  a  are  given 
to  the  right,  the  time/iteration  execution  diagram  is  given  to  the  left,  (d)  Optimal  solution  to  the  general 
SLSP  problem.  The  tas  and  (T,s  are  given  to  the  right,  the  time/iteration  execution  diagram  is  given  to  the 
left. 
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Given 

1.  a  machine  M,  a  set  of  resources  7?  =  {ri,  . . .  ,rm},  and  a  resource  configuration  vector  Rj^, 

2.  a  loop  £  comprising  n  iterations  and  composed  of  operations  O  —  {opi , . . . ,  o-pj , . . . ,  op/ } , 

3.  a  duration  function  (i  :  O  — ►  N  and  a  resource  usage  function  Ro  :  O  x  Z  — ►  N"", 

4.  a  data  dependence  grapii  DDG=  {O ,  E)  imposing  a  partial  order  on  O 

5.  a  tuple  of  functions  [A(e),6(e)]  associated  with  each  edge  e  of  the  DDG  (remember  from  section  2.1 
that  each  edge  e  =  (opj^ ,  opj^)  €  E  has  associated  a  tuple  [A(e),(5(e)],  whose  meaning  is:  operation  opj^ 
must  execute  6(e)  cycles  after  operation  opj^  of  the  A(e)th  previous  iteration  has  started  executing); 

the  goal  is  to  find 

General  problem:   schedules  <Ti  ,  (To,  •  •• ,  <^,,  ■  ■  ■ ,  cr„  :  O  — >  N  and  initiation  intervals 
<,,(!,  2),  . .  . ,  t,,(i  —  1,  j),  .  . . ,  i,,(n  —  1,  n)  respectively  for  iterations  1,  2,  .  .  . ,  z,  .  . .  ,  n 

Restricted  problem:   a  fixed  loop  schedule  a  :  O  ^  N  and  a  constant  initiation  interval  /,, 
subject  to  the  following  conditions: 

1.  minimality: 

General  problem:     max  (<,,(  1, ;')  + /en((7, ))  is  minimum 

1  <i<n 

Restricted  problem:   ta  and  len{(7)  are  minimum 

2.  dependence  constraint: 

General   problem:    given  a  data  dependence  edge  e  =  {opj,,opj^)  €  E  there  are  two  cases 
to  consider: 

(a)  if  A(e)  =  0  then  we  must  have:  V  1  <  i  <  n    cr,(opjj)  —  <T,(opj, )  >  6(e) 

(b)  if  A(e)  ^  0  then  we  must  have: 

VI  <  j'<  n- A(e)    t„(i,i+  X{e))  +  <T,+A(e)(oPj,)  -  a,(opj,)  >  6{e) 
By  merging  the  two  conditions  we  obtain: 
Ve  =  iopj,,opj^)  e  E    VI  <  I  <  T!-A(e)      (T,  +  x{e){opj,)  -  (T,(opjJ  >  6{e)-i„{i,i+X{e)) 

Restricted  problem:    in  the  restricted  case  we  have  cr,^x^f^  —  a,  :^  a  and  i,,(i,i '+  A(e))  = 
A(e)  •  t,i,  this  gives  the  following  data  dependence  constraint; 

'^e  -  {opj,,opj^)  €  E    (r{opj^)  -(^(op;,)  >  6{e)  -  A(f)  ■  <„ 

3.  resource  constraint: 

General  problem:  extending  the  resource  constraint  of  section  3.1  to  the  iterative  case  gives: 

/        n 

V0<<<    m<w(<,. +/en(<7.))     ^  ^  ^o(op; ,  <  -  (<,.(1,  0  +  <T,(op; )))  ^   Rm 
-'-"  j  =  i  1  =  1 

Restricted  problem:   the  resource  constraint  for  the  restricted  problem  can  be  derived  from 
the  previous  formula.    However  a  more  intuitive  understanding  is  obtained  by  deriving 
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the  formula  directly.  In  the  restricted  case  iterations  all  have  the  same  schedule  and 
are  initiated  at  the  rate  of  one  iteration  every  <„  cycles,  the  total  resource  requirements 
of  every  i,,th  instruction  in  L  should  therefore  not  exceed  the  available  limit  (note  that 
operations  lasting  more  than  in  cycles  must  be  handled  properly).  All  these  considerations 
yield  the  following  constraint: 

V0<<<<„     ^       ^       Ro{ovj  A  -  {cT,[ov)  moA  l,i) -V  X  ■U^)<   Rm 

j=l  r=0 

Like  the  local  compaction  problem,  the  restricted  and  general  SLSP  problems  are  NP-compiete  [Lam87,Garey79]. 

Lam's  algorithm  (section  5.1.2)  is  based  on  the  restricted  SLSP  problem.  As  we  already  mentioned 
this  choice  is  motivated  by  the  concern  of  minimizing  object  code  size,  and  by  the  fact  tiiat  in  the  infinite 
resource  case  (no  resource  constraints)  optimal  results  for  the  restricted  SLSP  problem  are  guaranteed  never 
to  be  more  than  n  cycles  slower  then  the  time  optimum  (see  section  5.1.2).  Aiken  and  Nicolau's  algorithm 
(section  5.1.4)  gives  an  optimal  solution  to  the  general  SLSP  problem  in  the  infinite  resource  case. 

5.1.2      A  General  Algorithm  for  the  Restricted  SLSP  Problem 

The  pipelining  techniques  of  Charlesworth  and  Su  are  too  rigid:  the  initially  fixed  loop  schedule  imposes 
an  unnecessary  constraint  on  the  initiation  interval  <„ .  One  of  the  reasons  for  this  rigidity  comes  from  the 
circular  dependency  between  <„  and  a  (scheduling  constraints  are  defined  in  terms  of  the  initiation  interval 
<i, ,  however  determining  if  <„  satisfies  resource  and  data  dependence  constraints  depends  on  the  schedule  a 
chosen). 

To  resolve  this  circularity  Lam  [Lam87,Lam88]  adopts  an  iterative  approach  (see  below).  Her  method 
can  handle  any  type  of  straight  line  loop  body.  The  algorithm  proceeds  as  follows: 

1.  Based  on  resource  and  data  dependence  constraints,  determine  the  lower  and  upper  bound  off,,.   Let 
<min  and  imaz  be  such  lower  and  upper  bounds.  Initially  set  /;,  to  <min- 

2.  Partition  the  loop  DDG  into  strongly  connected  components. 

3.  Find  a  separate  schedule  for  each  strong  component  using  an  adapted  list  scheduling  algorithm. 

4.  Condense  each  strong  component  into  a  single  node. 

5.  Schedule  the  condensed  DDG  (which  is  now  acyclic)  using  list  scheduling. 

6.  If  any  of  steps  3  or  5  fails  (i.e.   we  cannot  find  a  schedule  given  the  current  /,,),  increment  <,,  by  one 
and  repeat  steps  3  through  5.  If  <„  is  set  to  imaz  Ihe  loop  cannot  be  pipelined. 

Let  £  be  the  loop  to  be  pipelined.  The  detailed  description  of  the  algorithm  follows. 

Lower  and  upper  bounds  on  <,,.  The  upper  bound  tmai  is  easily  determined  by  applying  list  scheduling 
to  L:  imar  IS  the  length  of  the  schedule  obtained  from  list  scheduling.  The  lower  bound  <„,,„  is  imposed  by 
resource  and  dependence  constraints. 
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1.  Resource  constraints:  the  final  software  pipelined  loop  C,p  contains  all  the  operations  in  £.  Since 
£,p  takes  tu  cycles  to  execute,  the  total  number  of  resource  units  available  in  t,,  cycles  must  be  greater 
than  or  equal  to  the  resource  requirements  of  all  £'s  operations  put  together  (this  is  the  critical  resource 
constraint  of  Charlesworth  adapted  to  the  multiple  resource  case),  that  is: 

I     d(op,)-\ 

Wl<k<m        t„  ■  RM{k)>Y^      Yl      Ro{oVi,x)(k) 

;    d(or,)-i 
^     Y,    Ro{opj,x)ik) 

^       ^11  ^     max  :::  —  ^miriR 

l<k<m  RM{k) 

2.  Dependence  constraints:  cycles  in  the  DDG  impose  a  lower  bound  on  the  initiation  interval.  Let 
C  =  [opj^,  opj^ ,  .  . ,  op;^_,]  be  a  cycle  of  the  DDG.  For  each  edge  e  =  (opj^ ,  op;,  ,^^j_. )  the  dependence 
constraint  imposes  <''(o;';,,+,|„„j^ )  —  <^{opj,)  >  fi{e)  —  1„  ■  A(e),  that  is  if  we  sum  all  the  C  inequalities 
together, 

^,  >  d{C)/\(C) 

where  S(C)  and  X(C)  are  the  sum  of  the  individual  (Ss  and  As  of  each  edge  in  C.  Thus 

Cgcycles  of  DDG 

Taking  the  maximum  of  <^,>,„  and  <^,„„  gives  <„,„  =  max(<„„n„,  <,„,„„). 

Partitioning  the  DDG.  The  DDG  is  partitioned  into  strong  components  by  using  any  of  the  standard 
algorithms  [Tarjan72,Sharir81]. 

Scheduling  a  strong  component.  Let  C  be  a  strong  component  of  the  DDG,  and  let  o/);,,  opj^  be 
any  two  operations  in  C.  Assume  opj^  has  been  scheduled,  that  is  (T(opj^)  is  known.  Adding  dependence 
constraints  along  LPop^^^op,^  and  LPop^^^op,^  (the  longest  path  from  o/)_,,  to  opj^,  and  from  op^^  to  o;'jJ'" 
yields: 

('(oPh)  -  iHLPop,^-op,^)  -  t„  ■  X(LPop,^^„p^^))  >  a(opj^)  >  a(op,,)  +  6{LPop^^-opJ  -  t„    X{LP,p^^^or„) 

thus  the  scheduling  interval  of  the  unscheduled  C  operations  shrinks  every  time  a  new  operation  of  C  is 
scheduled. 

C  operations  are  scheduled  by  applying  list  scheduling  to  the  subgraph  composed  of  all  the  edges  e  of  C 
such  that  A(e)  =  0  (such  subgraph  is  acyclic).  The  priority  function  used  is  earhest  deadline  first.  Here  is 
how  the  scheduling  algorithm  proceeds. 

1.   Let  Data. Ready  be  the  set  of  operations  of  C  whose  incoming  edges  have  non  zero  A  value. 


To  determine  longest  paths  the  "all-pairs  longest  path  problem"  is  solved  with  the  cost  of  aii  edge  e  set  to    6{e)  -  (,,  ■  A(e). 
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2.  Pick  any  operation  op^g  in  Data. Ready  and  set  (T(opj„)  =  0.  Then  for  eacii  unscheduled  opj^  compute 
[o'mxn(opj ^);  (Tmaiiopj ^)] ,  the  time  interval  in  which  opj^  has  to  be  scheduled  (Cmin(opj^ )  and  crmax(opj  J 
are  computed  using  the  inequalities  of  the  previous  paragraph): 

(^m.niopj,)  =  CT(opjJ  +  6{LPop^g^op,J  -  tii  ■  MLPop,g^or,^) 
(TmarioPj^  =  (T(op,„)-  {  6{LPop  ^^  ^op,„)  -  Ui  ■  \{LP„p  ^^^op  ,J) 

3.  Put  all  the  operations  whose  operands  have  been  scheduled  in  Data. Ready. 

4.  Select  the  operation  opj^    eData. Ready  such  that  (Tmai[opj^)   —   (  min  <^max(op)),   'e-     the 

opeData.Ready 

operation  with  the  earliest  deadline. 

5.  Schedule  opj,  by  scanning  all  the  time  slots  from  (Tminiopj,)  to  (TmaxioPj,)^  selecting  the  first  cycle  in 
which  the  restricted  resource  constraints  are  satisfied.  If  such  time  slot  does  not  exist  stop:  a  schedule 
for  C  cannot  be  found  with  the  current  t^,  (/;,  has  to  be  increased).  Assuming  a  time  slot  is  found,  set 
cr(opj^)  to  that  time  slot. 

6.  Recompute  arnm  and  (Tmar  for  any  unscheduled  operation  opj^: 

O-miniopj,)  -  max((7m.n(opjJ,  (T{opj,)  +6{LPop,_^op,^)  -  til  ■  KLPop,,~op,^)) 
(rmariopj^)  =  min(  (T^a^  (opj  J  ,  (T(op;J  -  (8{LPop^^^op  „)  -  t„  ■  X{LPop  ^^-op„))) 

7.  Repeat  steps  3  to  6  until  all  the  operations  of  C  have  been  scheduled  or  some  operation  cannot  be 
scheduled. 

Condensing  strong  components.  Once  a  schedule  ac  for  every  strong  component  C  =  {Oc,Ec)  has 
been  determined,  C  is  condensed  into  a  single  node  in  the  DDG  and  is  subsequently  treated  like  any  standard 
operation.  The  duration  function  of  C  is  defined  to  be  d{C)  =  len{ac),  its  resource  usage  function  is  defined 
to  be  the  sum  of  the  resource  usage  functions  of  the  operations  that  make  up  C,  that  is: 

f       H     Rv{opj,x-ac{opj))     UO<x<d{C) 
Ro{C,x)  =  <    op,eOc 

(   0  otherwise 

The  condensed  DDG  is  obtained  from  the  original  DDG  as  follows.  Let  e  —  (opi,op2)  be  an  edge  of  DDG 
such  that  opi  G  C^  and  op2  €  C2,  Cj  and  C2  being  two  different  strong  components.  Create  an  edge  e'  from 
Ci  to  C2.  Set  A(e')  =  A(e),  and  6(e')  —  <5(e)  +  ((^c,{°ri)  ~  ^C:i(op2))-  If  multiple  edges  between  Ci  and  C2 
arise,  select  the  edge  e'  with  biggest  6(e')  —  ta  ■  A(e'). 

Scheduling  strong  components.  The  nodes  of  the  condensed  DDG  (the  strong  components)  are  sched- 
uled using  list  scheduling,  with  the  exception  that  the  resource  constraint  equation  for  the  restricted  SLSP 
problem  is  used  instead  of  the  one  given  in  section  3.1.  The  priority  function  used  takes  into  account  the 
fcict  that  nodes  represent  several  operations:  if  Ci  represents  a  operations  and  Cn  represents  b  operations, 
a  >  6,  C]  is  given  priority  over  C^- 
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As  we  already  mentioned  in  section  5.1.1,  the  body  of  the  software  pipelined  loop  can  be  obtained  by 
packing  together  every  tath  instruction  of  a.  Generating  the  prelude  and  postlude  is  straightforward  if  we 
assume  that  n,  the  number  of  iterations  executed,  is  greater  than  or  equal  to  [/en((T)//„]  (that  is  we  execute 
the  body  of  the  software  pipelined  loop  at  least  once).  To  insure  that  it  is  always  the  case,  a  test  performed 
at  run  time  decides  whether  to  execute  the  original  loop  £  or  the  software  pipelined  loop  C,p. 
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Figure  24:  (a)  Data  dependence  graph  of  some  loop  C;  all  loop  carried  dependences  (dashed  edges)  are  from 
one  iteration  to  the  next  (A(e)  =  1).  (b)  The  schedule  of  strong  component  Ci.  (c)  The  loop  schedule  after 
scheduling  the  condensed  DDG.  (d)  Final  loop  schedule  (the  initiation  interval  ta  is  3). 

We  illustrate  Lam's  algorithm  on  a  simple  example.  Assume  a  machine  with  unlimited  resources  (no 
resource  constraints),  where  all  operations  take  1  cycle.  Consider  a  loop  with  a  data  dependence  graph  as 
portrayed  in  figure  24(a).  The  DDG  has  nine  strong  components  Ci  =  {ol, . . .  ,o6},  Co  =  {o7},  . . . ,  Cg  — 
{ol4}.  The  cycle  ol— ►o3— >o5— *ol  of  Cj  imposes  ta  >  3  (ti,  =  3  is  the  final  initiation  interval).  The  schedule 
obtained  for  Ci  is  given  in  figure  24(b);  after  condensing  Ci  and  scheduling  the  condensed  DDG  with  list 
scheduling  we  obtain  the  schedule  of  figure  24(c)  which,  once  Ci  has  been  replaced  by  its  schedule  (Tc,  ,  yields 
the  final  schedule  of  figure  24(d). 

Lam's  algorithm  might  unnecessarily  lengthen  the  value  of  ^,  if  a  variable  is  redefined  at  the  beginning 
of  every  iteration.  Modulo  variable  expansion,  a  variation  of  the  scalar  expansion  technique  of  vectorizing 
compilers  [Padua86],  is  introduced  to  avoid  this  problem.  The  pipelining  algorithm  is  modified  as  follows: 

1.  Pretend  that  every  iteration  of  £  has  a  dedicated  set  of  registers  (so  that  no  spurious  loop  carried 
output  dependences  arise  because  of  temporaries). 
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2.  Identify  the  variables  which  are  redefied  at  the  beginning  of  every  iteration.  "Expand  those  variables 
into  arrays",  i.e.  for  each  redefined  variable  use  a  different  register  in  each  iteration  (this  is  the  standard 
scalar  expansion  technique). 

3.  The  algorithm  then  proceeds  as  previously  stated.  The  resulting  schedule  a  is  used  to  determine  the 
actual  number  of  registers  that  must  be  allocated  to  each  redefined  variable  (and  all  the  temporaries):  if 
the  lifetime  of  a  register  variable  is  It  cycles,  then  \hlttt\  different  values  must  be  kept  alive  concurrently 
in  that  many  registers.  If  a  variable  is  allocated  r  registers,  then  the  fth  iteration  will  use  the  (/  mod  r)th 
register  allocated  to  the  variable. 

4.  Without  modulo  variable  expansion  the  number  of  VLIW  instructions  in  the  body  of  £,p  is  i,,.  When 
modulo  variable  expansion  is  used,  iterations  using  different  sets  of  registers  must  execute  different 
sequences  of  instructions.  Let  Tt,,  be  the  number  of  registers  allocated  to  variable  v,.  and  let  u  be  the 
least  common  multiple  of  the  r^j^s.  The  body  of  the  final  software  pipelined  loop  is  obtained  by  packing 
together  every  /,,th  instruction  of  a  and  then  replicating  u  times  the  <,,  instructions  obtained  by  the 
packing  process  (register  variables  have  to  be  appropriately  renamed  during  the  replication).  The  final 
body  of  the  software  pipelined  loop  contains  u  ■  /,,  instructions. 

Lam  has  also  presented  a  simple  hierarchical  method  to  handle  loops  with  conditional  constructs,  but 
with  reducible  flow  graph.  The  THEN  and  ELSE  branches  of  a  conditional  statement  are  first  independently 
compacted  using  list  scheduling,  and  then  padded  to  the  same  length.  Once  this  is  done  the  entire  conditional 
statement  is  reduced  to  a  single  node,  much  in  the  same  way  a  strong  component  is  condensed.  This  node 
can  be  treated  within  the  surrounding  construct  (the  loop  or  another  conditional  statement)  like  any  other 
simple  operation.  At  code  emission  time  any  operation  scheduled  concurrently  with  the  conditional  node  is 
duplicated  in  both  branches  (note  that  this  might  lead  to  code  explosion  if  conditional  branches  are  scheduled 
concurrently).  This  method  is  very  simple,  and  its  main  intent  is  to  prevent  conditionals  from  preventing 
loop  pipelining.  It  remains,  however,  that  the  presence  of  conditionals  inside  a  loop  inhibits  the  effectiveness 
of  software  pipelining. 

Lam's  software  pipelining  technique  has  been  extensively  tested  on  signal  processing  and  scientific  code 
for  the  Warp  machine  (a  linear  systolic  array  composed  of  10  VLIW  cells  [Annaratone87]).  To  study  the  gain 
in  performance  obtained  from  software  pipelining,  Lam  measured  the  speedup  of  pipelined  versus  compacted 
loops:  the  average  factor  of  increase  in  speed  was  three. 

In  the  infinite  resource  case  (no  resource  constraints)  Lam's  algorithm  yields  an  optimal  solution  to  the 
restricted  SLSP  problem.  This  optimal  solution  is  never  "too  far"  from  the  time  optimum.  We  now  formalize 
the  "never  too  far"  notion.  Lam's  algorithm  employs  the  data  dependece  cycles  of  C  to  establish  a  lower 
bound  on  <,,.  Strictly  speaking  however,  cycles  impose  a  scheduling  constraint  only  on  the  operations  of  the 
cycle.  Let  C  —  [opj^^opj^ , . . . ,  opj^_^]  be  a  data  dependence  cycle  of  £.  For  each  edge  e  —  {opj^,  o/'j(r+i)mod<: ) 
the  general  dependence  constraint  imposes: 

«^.+A(.)(oPj, .+  ,,„.,, J  -cr,{opj^)  >  <5(e)-  U,{i,i+  A(e)) 

that  is  if  we  sum  all  the  C  inequalities  together: 

V  1  <  ,■  <  n  -  A(C)    Vop,,  €  C    a,+,(C)(op,  J  -  (7,iopj^)  >  6(C)  -  U,{iJ+\{C)) 
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where  6(C)  and  A(C)  are  the  sum  of  the  individual  6s  and  As  of  each  edge  in  C.  Lam's  aigoritlim  satisfies 
such  constraints  by  scheduling  consecutive  copies  of  opj^  at  constants  intervals  of  /,,  =  \6(C)/X{C)]  cycles. 
If  X{C)  divides  6(C)  this  yields  a  time  optimal  solution,  otherwise  it  does  not  and  the  solution  obtained  is  off 
the  time  optimum  by: 

SiO        6{C) 

where  n  is  the  overall  number  of  iterations  executed. 


5.1.3      An  Optimal  Algorithm  for  the  Restricted  SLSP  Problem  Using  Special  Hardware 

The  polycyclic  architecture,  developed  by  Rau  and  Glaeser  [Rau81,Rau82a,Rau82b]" ,  can  be  employed  in 
conjunction  with  Lam's  algorithm  to  obtain  optimal  solutions  for  the  restricted  SLSP  problem  in  the  case 
of  loops  with  no  carried  dependences.  The  key  idea  of  the  architecture  is  to  ensure  that  explicitly  scheduled 
resources  perform  only  those  operations  that  are  explicit  in  the  loop  data  dependence  graph.  This  is  achieved 
by  fully  interconnecting  functional  units  through  a  crossbar  and  placing  a  special  dedicated  buffer,  called 
delay  elemevt,  between  the  output  and  the  inputs  of  every  functional  unit  (see  figure  25).  A  delay  element 


! 1  1- 

i <>- 

-1 1 1 

1 — II 

-H  1 1 

1 — 1 

, ^ 

1                          I     

-1  1 1 

1 — II 

FU 

F 

u 

FU 

J. 

wrile    — • 

last    '• 

elemenl 

-►  >  read 

-^                  r 

delay  elemenl 

Figure  25:  The  output  of  every  functional  unit  is  connected  to  the  inputs  of  every  functional  unit  through  a 
delay  element. 


is  partly  a  queue  and  partly  a  register  file  i.e.,  data  can  be  read  from  any  location  in  the  delay  element,  but 
can  only  be  written  at  its  end.  Also,  a  read  operation  has  the  option  of  deleting  the  data  read  from  the  delay 
element  by  shifting  down  all  the  values  above  it. 

Operations  executed  on  the  polycyclic  architecture  do  not  reference  any  registers,  they  just  specify  the 
functional  units  on  which  they  execute,  the  location  of  the  operands  in  the  input  delay  elements,  and  the  delay 
elements  in  which  the  results  should  be  stored.  No  register  allocation  is  therefore  needed  and  the  buffered 
crossbar  has  the  effect  of  implicitly  implementing  the  modulo  variable  expansion  optimization  of  Lam.  As 
we  already  mentioned  optimal  solutions  for  the  restricted  SLSP  problem  can  be  obtained  by  adapting  Lam's 
algorithm  to  the  polycyclic  architecture.  Note  that  optimal  solutions  can  be  obtained  only  for  loops  with  no 

A  similar  architeclure  has  been  recently  proposed  by  Jegou  and  Seznec  [J(^gou8(3]. 
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carried  dependences;  for  other  types  of  loops  the  polycyclic  architecture  does  not  iielp  in  finding  an  optimal 
solution  in  polynomial  time.  The  major  drawback  of  the  polycyclic  approach  is  the  cost  of  the  interconnect. 
The  polycyclic  architecture  has  been  the  basis  for  a  commercial  machine,  the  Cydra  5  [Rau89],  which  has 
been  recently  built. 
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5.1.4      An  Optimal  Algorithm  for  the  General  SLSP  Problem  with  Infinite  Resonrces 

The  general  SLSP  problem  is  NP-complete,  however  polynomial  time  algorithms  can  be  found  if  one  is 
compiling  for  a  machine  with  unlimited  resources  (i.e.  no  resource  constraints).  One  such  algorithm,  greedy 
scheduling,  has  been  devised  by  Aiken  and  Nicolau  [Aiken88b].  Greedy  scheduling  can  handle  only  single 
cycle  operations;  in  what  follows  we  present  a  modified  version  of  the  algorithm  that  handles  multicycle 
operations. 

Let  £  be  the  input  loop,  greedy  scheduling  works  by  scheduling  £'s  operations  in  a  greedy  fashion,  that  is 
as  early  as  dependences  permit.  This  is  achieved  by  unrolling  the  first  pc  iterations  of  £  (for  some  pc  G  N) 
and  applying  list  scheduling  to  the  unrolled  iterations.  For  a  large  enough  pc  the  greedy  schedule  should 
contain  a  repeating  pattern.  The  amount  of  unrolling  pc,  necessary  to  detect  such  pattern  is  determined  by 
the  cycles  in  £'s  DDG.  Let  C  be  one  such  data  dependence  cycle,  we  define  slope(C)  =  6{C)/X{C)  (where 
6{C)  and  A(C)  are  respectively  the  sum  of  the  6(e)s  and  A(e)s  over  all  the  edges  e  in  C).  For  each  operation 
op  €  £  we  define  C[op]  to  be  the  data  dependence  cycle  of  maximum  slope  in  which  op  is  involved  and 
T(op)  =  2  ■  A(C[op])  •  6(C[op])  +  3  ■  \(C[op]).  Aiken  and  Nicolau  have  proved  the  following: 

Theorem:  every  greedy  schedule  of  a  loop  C  comprising  n  iterations,  satisfies  tlie  following  prop- 
erty: 

Vope  C      V  T(op)  <  I  <  n         <„(i,i  +  X{C[o]>]))  -\-  (T,  +  x(Clo,.])(op)  -  '^'{"P)  =  ^(C[op]) 

that  is  after  scheduling  T{op)  +  1  +  A(C[op])  iterations,  any  occurrence  of  operation  op  is  sciieduled 
exactly  f'{C[op])  time  steps  after  the  occurrence  of  op,  X{C[op])  iterations  before. 

Thus  we  just  need  to  set  pc  —  (max  T(op))  +  1.   This  makes  greedy  scheduling  a  polynomial  time  algo- 

op€C 

rithm,  since: 

Vop€£     HC[op])  <  I  ■  6^^^     and     X{C[op])  <  I  ■  Xmar      =>     Pc  =  0(1-) 

(I  is  the  number  of  operations  in  £  and  6mai,  ^mar  are  the  maximum  values  of  6  and  X  over  £'s  DDG). 

Unfortunately  the  pattern  produced  by  such  algorithm  is  still  too  irregular  to  be  practically  exploited  by 
a  VLIW  machine.  Consider  for  instance  the  loop  whose  DDG  is  given  in  figure  24(a).  The  greedy  schedule 
obtained  for  such  a  loop  (cissuming  single  cycle  operations)  is  given  in  figure  26.  The  schedules  have  some 
undesirable  "gaps"  that  prevent  generation  of  efficient  VLIW  code  (the  loop  has  to  be  fully  unrolled). 

To  remedy  such  problem  Aiken  and  Nicolau  have  introduced  the  notion  of  a  region.  Informally,  a  region 
"■,"   ('i-'2)  of  <T,  is  a  time  slice  of  a;  containing  no  "gaps",  that  is: 

(T.-'(<.../2)  =  [T-'(n);  ..;^.-'(r2)]      where      a;'{t)  =  {op  G  £  |  cr.(op)  =  t) 

A  maximal  region  is  a  region  which  is  of  maximal  size.  In  the  above  example  a^  has  two  maximal  regions: 
[{o7};{o9}:{ol0};{ol2,ol3}:{oll.ol4};{o8}]  and  [{o2}:(ol,o4} ,{o3.o6}:(o5}].  Note  that  between  any  two 
maximal  regions  of  a,  there  must  exist  at  least  one  cycle  (a  "gap")  where  no  operation  of  £  is  scheduled. 
These  gaps  play  a  key  role  in  generating  more  concise  software  pipelined  code. 

Aiken  and  Nicolau  have  presented  a  condition  under  which  the  gaps  in  a  schedule  a,  +  ;  can  be  shrunk  to 
the  size  of  the  gaps  in  schedule  a,  (provided  t,{(i,  i  +  k)  is  increased)  without  affecting  the  time  optimality 
of  the  final  code. 
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Figure  26:  Schedules  for  the  loop  of  figure  24produced  by  greedy  scheduling.  All  the  initiation  intervals  are 
equal  to  zero  (<,,(z,f  +  1)  —  0),  the  above  table  is  therefore  identical  to  the  time/iteration  diagram. 


Theorem:  let  lo  and  iq  +  A  be  the  first  two  greedily  scheduled  iterations  of  C  such  that  Cg 
and  <T,o  +  ^  have  the  same  maximal  regions  and  the  inter-region  gaps  in  <t,„  +  a  are  as  large  or 
larger  then  in  (j,^.  If  there  is  no  chain  of  dependences  from  a  region  in  (t,„  (except  the  last)  to 
the  last  region  of  <t,„  +  a  then  a  more  "regular"  solution  (denoted  with  ')  to  the  general  SLSP 
problem  with  unlimited  resources  can  be  obtained  from  greedy  scheduling  as  follows.  Let  t'„  = 
<m(io,«o  +  A)  +  /en((T,„  +  A)  -  /eTi((T,„),  then; 


V  1  <  I  <  10  +  A 


V  lo  +  A  <  t  <  n 


<T,    =   (T, 

(;.(i,.)  =  <..(i,.) 

<^i    =  ('■.o-Ki  — o)modA 

';,(«o.«)  =  L^J  •<;,  +  t„(io,to  +  (>->o)mod  A) 


that  is  we  can  reduce  the  inter-region  gaps  of  cr, „  +  /,+  ((  a  to  the  size  of  the  corresponding  gaps  in 
"■lo+h  (0  <  /i  <  A,  1  <  t)  and  increase  t,,{io,  to  +  h  +  k  ■  A)  by  the  amount  the  inter-region  gaps  of 
"■10+ A,  <^io+A  +  i .  ■  ■  ■  ,t^,Q■^h+k  A  have  been  shrunk,  without  affecting  the  time  optimalily  of  the  final 
result. 

Compact  code  can  be  generated  by  considering  the  schedules  and  initiation  intervals  of  iterations  io,  '0  + 
1, . . . ,  t'o  +  A  —  1  as  a  single  ngid  schedule  a'  and  packing  together  (while  adjusting  array  indices)  every  <I,th 
instruction  of  a'  (iterations  1  through  iq  —  1  form  the  prelude  of  the  software  pipeline). 

The  only  problem  with  this  approach  is  that  it  no  longer  guarantees  a  polynomial  time  algorithm,  in  fact: 
let  km  denote  the  least  common  multiple,  then  A  >  lcmopg£(A(C[op])),  which  can  potentially  be  exponential 
in  /  (the  number  of  operations  in  £). 
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5.2      Complex  Software  Pipelining  Techniques 

In  this  section  we  survey  two  pipelining  algoritiinis  that  are  able  to  process  loops  with  arbitrary  condi- 
tional jumps.  We  first  review  pipeline  scheduling  developed  by  Ebcioglu  [Ebcioglu88,Ebcioglu87]  to  pipeline 
sequential-natured  code  (section  5.2.1).  We  then  survey  perfect  pipelining  [Aiken88c],  a  software  pipelining 
technique  based  on  Nicolau's  percolation  scheduling  (section  5.2.2). 

5.2.1      Pipeline  Scheduling 

Pipeline  scheduling  [Ebcioglu87]  is  a  software  pipelining  technique  specifically  designed  for  sequential,  non- 
numerical  code:  loops  can  have  arbitrary  bounds  and  contain  if-then-else  as  well  as  exit  statements,  however 
arrays  are  assimilated  to  single  indivisible  variables.  The  target  machine  assumed  by  pipeline  scheduling  is 
an  abstract  VLIW  which  extends  the  PS  model  (Ebcioglu  is  currently  building  such  a  machine  [Ebcioglu88]). 

Abstract  machine  model.  Ebcioglu's  abstract  machine  model  Af  extends  Nicolau's  PS  model  (see  sec- 
tion 4.4.1),  by  allowing  operations  to  be  performed  conditionally,  that  is  the  Cji  part  of  an  instruction  / 
contains  operations  as  well  as  conditional  tests.  Formally  an  instruction  /  is  of  the  form: 

Li:  [Op/,0/,  Next,] 

where  Lj  and  Nexij  are  instruction  labels,  Opi  is  a  set  of  register  operations'",  and  Cji  is  either  empty  or 
of  the  form: 

{  IF  cTHEN  /'  ELSE  /"  } 

where  c  is  a  condition  code  register  and  /'  and  /"  are  two  unlabeled  instructions'"'.  Figure  27(a)  gives  an 
example.  Like  for  PS  instructions  can  be  represented  by  a  binary  tree  except  that  in  this  case  every  tree 
edge  may  be  labeled  with  a  set  of  operations  (see  figure  27(b)). 

The  execution  of  an  instruction  /  takes  a  single  cycle,  it  proceeds  as  follows.   Let  T/  be  the  tree  repre- 
sentation of  /. 

1.  Given  the  current  (i.e.  old)  values  of  the  condition  registers,  the  (unique)  execution  path  in  T/  is 
determined  by  following  the  right  edge  of  a  conditional  jump  if  the  condition  is  true,  and  the  left 
branch  if  the  condition  is  false.  For  instance,  assuming  cl  and  c2  are  both  true,  the  path  selected  in 
the  instruction  of  figure  27(a)  would  go  from  L]  to  L3. 

2.  The  operations  labeling  the  execution  path  are  executed  in  parallel  by  performing  all  the  read  first, 
and  then  all  the  writes.  If  a  variable  is  updated  by  two  different  operations  the  one  which  is  closer  to 
the  leaf  has  priority. 

3.  The  branch  taken  is  given  by  the  label  of  the  leaf  of  the  execution  path. 


Arrays  are  assimilated  to  single  indivisible  variables,  that  is  operations  a(l);— 1  and  3(2 J  ^^  aie  assumed  to  access  the  same 
variable  a. 

'^In  what  follows  instructions  such  thai  Opi  =  Cji  =  <t>  are  identified  with  their  Neitj  laljel  (for  instance  [.i),i/)X2]  's  identified 
with  Z/a) 
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A  program  for  the  abstract  machine  model  A/"  can  be  represented  by  a  flow  graph  where  each  vertex  contains 
a  single  A""  instruction.  This  representation  is  similar  to  the  one  adopted  in  for  PS,  and  is  likewise  termed 
parallel  program  graph.  Like  for  PS  a  vertex  t;  of  the  parallel  program  graph  may  have  an  arbitrary  number 
of  outgoing  edges  according  to  the  structure  of  the  instruction  represented  by  v. 


Lx.  [{a;=b+c;  b:=x(i);  i:  =  i+l;  c5:=(i<n)}, 

{IF  cl  THEN      [{d:=0},{IF  (NOT  c2)  THEN      [4>,<j>.L'>] 

ELSE        [{h;=y(i);j;=j+l},<^,L3]    }, -] 
ELSE        [{d:=x(i):  b:=0},.^,L7]    }.  — ] 


(a) 


IF  (NOT  c2) 


Figure  27:  (a)  An  instruction  of  Ebcioglu  abstract  machine  model    (b)  Tree  representation  of  the  instruction. 

Pipeline  scheduling.  Pipeline  scheduling  is  a  software  pipelining  technique  explicitly  developed  for  the 
previous  machine  model  M .  The  algorithm  exploits  the  ability  to  execute  operations  conditionally  by  per- 
forming, as  early  as  possible,  instructions  of  successive  iterations  on  every  execution  path.  Pipeline  scheduling 
has  been  designed  for  single  entry  loops,  however  in  the  case  of  multiple  entries  an  entry  instruction  is  given 
priority  and  the  algorithm  pipelines  only  those  execution  paths  that  originate  at  the  selected  entry  node. 
The  algorithm  is  divided  in  two  phcises: 

phase  1:  the  input  loop  is  compacted  using  the  greedy-compact  transformation  of  PS,  that  is  operations 
are  pushed  up  the  parallel  program  graph  of  the  loop  body  as  high  up  as  possible  (see  section  4.4.2)'''. 

phase  2:  Let  £  be  the  resulting  loop  and  let  Ii,  I^,  ■  ■  ■ ,  Ii  be  the  VLIVV  instructions  composing  £  (i.e.  the 
nodes  of  the  compacted  parallel  program  graph).  Let  Ii  be  the  selected  entry  instruction  of  £  and 
let  i  be  any  arbitrary  iteration,  then  instruction  Ii(i  +  1)  (the  first  instruction  of  iteration  ; -f-  1)  is 


Actually  any  global  compaction  transformation  is  acceptable  for  phase  1. 
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moved  up  all  the  execution  paths  starting  at  /i(i)  and  ending  at  Ii(i  +  1),  as  much  as  possible.  This 
is  accomplished  by  trying  to  schedule  Ii(i  +  1)  concurrently  with  each  one  of  the  branches  of  the 
instructions  directly  reachable  from  I\{i).  Due  to  resource  and/or  data  dependence  constraints  it  may 
not  be  possible  to  pack  Ii(i  +  1)  with  some  of  the  branches  of  the  direct  successors  of  Ii{i).  In  this 
case  we  try  to  schedule  Zi(j-|-  1)  concurrently  with  each  one  of  the  branches  of  the  instructions  that  are 
pointed  by  the  branches  for  which  the  initial  concurrent  scheduling  failed.  This  last  step  is  repeated 
until  Ii(i+  1)  is  scheduled  on  each  execution  path  from  li{i)  to  Ii{i+  1)  as  high  up  as  resource  and  data 
dependence  constraints  permit  (note  that  on  some  execution  paths  it  may  not  be  possible  to  schedule 
/i(!-|-  1)  at  all).  After  /i(!-i-  1)  has  been  moved  up,  the  children  (i.e.  direct  successors)  of  Ii{i+  1)  are 
pushed  up  in  the  same  manner.  However  if  one  of  these  children,  say  Ir(i  +  1),  was  such  that  I\(i+  1) 
has  been  scheduled  concurrently  with  one  or  more  of  /r(')'s  branches,  the  we  move  up  Ir{i  +  1)  with 
/i(j  +  2)  attached  at  the  corresponding  branches,  instead  of  jvist  moving  Ir{i  +  1)  alone. 

Given  that  the  machine  model  .\f  allows  only  register  operations,  all  loop  carried  dependences  go  from 
iteration  to  the  next  (i.e.  if  for  some  edge  e  of  £'s  DDG,  A(e)  >  0  then  A(e)  =  1).  This  is  what  allows 
/i(!-|-2)  to  be  moved  up  simultaneously  with  Ir(i+  1)  and  still  guarantee  that  Ir{i+  1)  will  be  pushed 
up  all  the  execution  paths  of  iteration  (  as  much  as  possible,  because  there  are  no  dependences  between 
operations  in  iteration  i  and  Ii(i  +  2)^^.  The  ability  to  move  Ii{i  +  2)  up  concurrently  with  /r(?+  1)  not 
only  saves  work  but  also,  and  especially,  guarantees  that  a  repetitive  pattern  will  be  generated  (that  is 
we  can  obtain  a  compact  pipelined  loop  schedule). 

After  all  the  children  of  li{i  +  1)  have  been  moved  up  all  the  execution  paths  of  iteration  i  as  much  as 
possible,  we  locus  on  the  grand  children  o{  Ii(i  +  I)  (i,e  the  direct  successors  of  the  direct  successors  of 
h(i  +  !))•  Let  /,(/  +  1)  be  one  such  grand  children,  /,()  +  1)  is  pushed  up  all  the  execution  paths  of 
iteration  1+  1  as  much  as  possible.  However  if  a  children  of  Ii{i+  1),  say  Ip{i+  1),  has  been  scheduled 
concurrently  with  one  or  more  of  Ig{i)'s  branches,  then  we  push  up  /,(i+  1)  vvilh  Ip{i+  2)  attached  at 
the  corresponding  branches.  Note  that  if  /i(i  +  1)  has  been  scheduled  concurrently  with  one  or  more  of 
Ip(i)  branches  then  we  push  up  /,(i+  1),  Ip(i  +  2),  /i(j  +  3),  where  /p(i  +  2)  and  /]((  +  3)  are  attached 
to  the  corresponding  branches  of  (respectively)  /,(i  +  1)  and  Ip{i  +  2). 

After  all  the  grand  children  of  Ii{i  +  1)  have  been  moved  up,  we  move  up  the  children  of  the  grand 
children  of  7i(i '+  1),  and  so  on.  This  process  is  repeated  until  all  the  instructions  of  iteration  i  +  1  have 
been  moved  up  iteration  i  as  much  as  possible  (note  that  by  moving  up  instructions  of  iteration  ; '+  1 
we  simultaneously  pull  up  instructions  of  iterations  i  +  2,  i '  +  3, . .  .  and  effectively  pipeline  iterations  i, 
i+  1,  1  +  2,  1 -f-  3, . . ..  Note  also  that  arbitrary  execution  gaps  may  happen  in  iterations  ; -f  I,  i  +  2, . . . 
while  they  are  concurrently  executing  with  iteration  i  (flexible  pipeline)'®). 

A  more  detailed  description  of  the  algorithm  follows.  Let  T  be  the  parallel  program  graph  of  the  initial 
compacted  loop,  and  let  T'  be  the  parallel  program  graph  of  the  software  pipelined  loop.  Pipeline  scheduling 
proceeds  as  follows. 


Note  thai  a  loop  Co  with  arbilrar>'  carried  dependences  can  be  converted  into  one  where  carried  dependences  go  from  one 

iteration  to  the  next,  by  iinroUing  Ca  (        max         A(e))  limes. 

teDDG  of  £„ 
While  the  previous  software  pipelining  algorithms  produce   rigid  pipelines   (i.e.     the  execxition  time  of  each  iteration  is 
predictable  at   compile   time),   pipeline  scheduling  creates  flertble    pipelines  (i.e.     the  execution  of  an   iteration  may  contain 
variable  time  gaps  not  predictable  at  compile  time). 
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1.  Copy  rii,  the  selected  entry  node  of  T  to  J"'  and  name  it  \n\\. 

2.  For  each  direct  successor  n^  of  nj,  different  from  nj  and  belonging  to  T  (i.e.   it  is  not  an  exit  node), 
enqueue  (hc,  [ni])  in  a  queue  Q  if  not  already  there. 

3.  Let  (rjp,  [nj^ , . .  . ,  "jj)  be  the  element  at  the  front  of  <5,  where  n^  is  a  node  of  T  and  [n^^ ,  . .  .  ,  "jj  a 
node  of  .F''^.  Copy  np  from  .F  to  T\  name  the  new  node  \nj,,  tij^ ,  ■  ■  ■  ,  ijj,  then 

for  each  branch  6  of  Uj,  which  does  not  branch  out  of  the  loop  do 

(a)  6  branches  back  to  ti\ 

if  [rij^ , . . . ,  TijJ  can  execute  concurrently  with  branch  6 
(i.e.  resource  and  data  dependence  constraints  are  not  violated)"'then 

append  a  copy  of  \\\^^ , . . . ,  n^  J  to  the  end  of  h 
else 

set  6  to  branch  to  [n^^  ,...,71^,] 

(b)  6  branches  to  nr  ?^  "i 

if  [nj^, . . . ,  nj,]  can  execute  concurrently  with  branch  6  then 

make  a  copy  of  [n^^ , . . . ,  n^J,  change  the  end  of  each  of  its  branches 

to  point  from  [n^^, . .  . ,  nej  to  [n^,  rie^,  . .  . ,  UeJ  and 

append  the  modified  [n^^  .••■.";,]  to  the  end  of  6. 

Enqueue  (rir,  ["eu,.  •  •  •  i  "e,])  if  not  already  in  Q  or  in  T' 
else 

set  6  to  branch  to  [or,  n^^ , . .  . ,  /ijj  and 

enqueue  (rir,  [rjj,,  ,  . .  . ,  n^,])  if  not  already  in  Q  or  JT' 

4.  Dequeue  (rip,  [rijj , . . . ,  n^J)  and  repeat  steps  3  and  4  until  Q  is  empty. 

The  prelude  and  postlude  of  the  pipelined  loop  are  automatically  generated  by  the  algorithm,  if  necessary. 

Ebcioglu  hcis  proven  [Ebcioglu87]  that  the  above  algorithm  terminates,  and  that  the  resulting  parallel 
program  graph  T'  is  semantically  equivalent  to  the  input  parallel  program  graph  T .  Pipeline  scheduling  hcis 
a  potential  exponential  code  increase,  in  practice  however,  code  explosion  would  be  prevented  by  resource 
shortages. 

We  demonstrate  pipeline  scheduling  on  an  example  (taken  from  [Ebcioglu87]).  Consider  the  C  program  of 
figure  28(a).  The  code  copies  a  string  a  composed  of  upper  and  lower  case  letters,  to  a  string  b.  In  the  copying 
process  lower  case  letters  are  converted  to  upper  case,  and  upper  case  letters  are  preceded  by  a  backslash. 
This  is  the  transformation  that  UNIX  performs  when  communicating  with  upper  case  only  terminals.  The 

We  are  about  to  generate  node  [np,  rij^  , . . . ,  n^,  ]  of  T'  by  trying  to  append  the  instruction  in  [tij^  , . . . ,  rij,]  to  each  of  the 
branches  of  the  instruction  contained  in  rip  (in  this  way  the  current  and  the  fc  future  iterations  i,  i  +  1,  i  +  2, .  . . ,  i  +  ^  will 

advance  by  one  instruction,  respectively  Tip,  n^^ ,  n^^j tij,  ).  If  there  is  a  branch  6  to  wliich  appending  is  not  possible  (due 

to  data  dependence  or  resource  conflicts)  then  the  computation  along  branch  6  of  instruction  rip  of  iteration  i  will  execute  alone 
in  .?"',  that  is  the  current  iteration  i  will  advance  by  one  instruction  (ip)  while  the  k  future  iterations  will  wait. 

When  ["jjj . . . . ,  Ti^,  ]  updates  a  variable  v  whose  old  value  is  used  in  ax\  instruction  reachable  from  branch  6,  [rij^ , . . . ,  rij,] 

cannot  be  appended  to  6.  On  the  fly  renaming  of  v  allows  [uj^ rij,]  to  be  appended  to  6,  and  is  performed  whenever  possible 

by  pipeline  scheduling. 
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intermediate  code  for  the  program  is  given  in  figure  28(b).  The  code  obtained  after  compacting  the  program 
with  PS  (and  performing  on  the  i\y  renaming  for  c)  is  given  in  28(c).  The  tree  representation  of  instructions 
Li,  L2,  L3  and  I4  is  given  in  figure  29.  Figure  30  demonstrates  pipeline  scheduling  on  the  loop  of  figure  28. 
The  instructions  generated  are  listed  in  the  order  in  which  they  were  added  to  T\  The  steady  state  of  the 
pipelined  loop  executes  at  1-2  cycles  per  iteration,  depending  on  whether  the  character  copied  during  the 
current  iteration  is  lower  or  upper  case.  The  loops  produced  by  pipeline  scheduling  executes  operations  as 
soon  as  possible,  regardless  of  whether  the  iterations  to  which  such  operations  belong  will  later  have  to  wait 
because  of  dependence  conflicts  (flexible  pipeline).  Let  us  illustrate  this  point  by  examining  the  behavior  of 
the  pipelined  loop  on  the  input  a="AbC".  The  output  will  be  b="\AB\C".  The  following  table  shows  the 
evolution  of  the  software  pipeline:  the  leftmost  column  gives  the  cycle  in  which  operations  are  executed, 
the  next  six  columns  tabulate  the  operations  that  are  executed  by  each  iteration.  Iterations  4  through  6  do 
useless  work  (we  assume  that  it  is  safe  to  access  a  past  its  end).  The  rightmost  column  gives  the  instruction 
actually  executed  by  the  pipelined  loop.  As  one  can  see  from  the  table  iterations  2  and  3  wait  while  iteration 
1  completes. 


cycle 

itcralton 

tnstniction  eieculcd 

1 

2 

3 

4 

5 

6 

1 

Li 

[Li\ 

2 

L2 

L, 

[L2,L,] 

3 

L3 

L-, 

Li 

[L3,L.,L,] 

4 

L4 

- 

- 

[L4,  L3,  Ln,  Ll] 

5 

L3 

Ln 

Li 

[i3,i2,Ll] 

6 

L3 

L2 

Lx 

[^^3,^2,^1] 

1 

U 

- 

- 

[U,L3,L2,Li] 

8 

^3 

Li 

L, 

[i3,L2,il] 

9 

£, 

Hardware  iiiiplenientation  of  the  abstract  machine.  Pipeline  scheduling  heavily  depends  on  the  ability 
to  perform  several  conditional  computations  simultaneously.  Ebcioglu  is  currently  designing  a  machine  with 
such  capabilities  [Ebcioglu88]  (figure  31  gives  the  block  diagram  of  the  machine  data  path).  The  machine  is 
programmed  via  an  1800  bits  long  instruction,  and  has  Nalu  (16)  32-bits  combinatorial'^  ALUs  (numbers 
in  parenthesis  give  the  values  for  the  current  prototype)  capable  of  executing  integer  as  well  as  floating  point 
operations.  The  results  of  comparison  operations  are  sent  to  2-bits  condition  code  registers  (one  per  ALU), 
while  other  operations  send  their  results  to  a  multiported  register  file  containing  N„g  (128)  33-bits  registers 
(the  multi-ported  register  file  has  Nalu  input  ports  connected  to  the  Nalu  ALU  outputs  and  2Nalu 
output  ports  connected  to  the  2Nalu  ALU  inputs).  Operations  can  be  interruptible  and  non-interruptible. 
Non-interruptible  operations  are  used  when  unsafe  code  motions  are  performed  (for  instance  when  b:=c/a 
is  moved  past  if  (a  5^  0.0)  then  ...).  If  an  uninterruptible  operation  yields  an  invalid  result,  the  invalid  bit 
of  the  destination  register  (2nd  bit  for  a  condition  code  register  and  33rd  bit  for  the  other  registers)  is  set 
and  no  exception  raised.  If  one  of  the  arguments  of  an  interruptible  operation  is  invalid  an  exception  occurs. 

'^This  choice  is  motivated  by  the  faster  execution  time  of  combinatorial  over  pipelined  hardware.  Pipelined  hardware  trades 
fast  execution  time  over  the  ability  to  start  operations  every  cycle.  In  inherently  sequential  code  an  iteration  .  will  often  have 
to  wait  for  a  result  from  iteration  i  -  1.  It  is  therefore  more  important  that  the  overall  execution  time  of  operations  be  small 
rather  than  beii\g  able  to  start  opeiatioits  every  cycle. 


60 


char  a[BUFSIZE],  b[BUFSIZE]; 
int  c,sl,s2; 

sl=s2=0; 

while(c=a[sl++])  { 

if(c>'A'  &,L  c<T)  { 
b[s2++]  =  'W, 
b[s2++]=c; 

} 
else  { 

b[s2++]=c+CA'-'a'); 

} 
} 

/*  b  and  s2  are  live  here  */ 


Loop. 

c:=a(sl) 

si  :=  sl  +  1 

ccl  :=  (c=0) 

IF  ccl  GOTO  Exit 

cc2  :=  (c<'A') 

IF  cc2  GOTO  Lower.case 

cc3  :=  (O'Z") 

IF  cc3  GOTO  Lower  .case 

b(s2)  :=  A' 

s2  :=  s2+l 

b(s2)  :=  c 

s2  :=  s2+l 

GOTO  Loop 
Lower  .Case: 

t  :=  c+('A'-'a') 

b(s2)  :=  t 

s2  :=  s2+l 

GOTO  Loop 
Exit: 


(a) 


(b) 


Li:  [{  c:=:a(sl)i  sl:=sl  +  l},0,L2] 

L2:  [{  ccl:  =  (c=0);  cc2:  =  (c<'A);  c3:=(c>'Z');  t:  =  c+CA-'a');  c'i^cJ^.Lg] 

L3:  [0.{IF  ccl  THEN  E^ 

ELSE  [(b(s2):=t;  s2:=s2+l},  {IF  cc2  THEN  Lj 

ELSE  [<?,{IF  cc3  THEN  L, 

ELSE  [{b(s2):  =  '\';  c";  =  c'}.  <i>.L,]  }.-]  },-]  }. 
L4:  [{b(s2):=c";  s2:=s2+l},(;i,Li] 
E,: 

(c) 

Figure  28:  (a)  A  simple  C  program,   (b)  Us  intermediate  code  representation,  (c)  The  program  compacted 
with  PS. 
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LI 

o 

c:=a(s1) 
..s1:=sU1 

o 

L2 


L2 

o 

..  cc1;=(c=0) 
..   cc2:=(c<'A') 
..  cc3:={c>'Z') 
l:=ctCA'-'a') 


O 
L3 


Figure  29;  Tree  representation  of  instructions  Li,  Lo,  L3  and  L4. 


Memory  loads  as  well  have  non-interruptible  version,  this  is  to  prevent  traps  when  protected  or  non-existent 
memory  locations  are  accessed  (in  the  previous  example  for  instance,  we  want  to  prevent  exceptions  from 
occurring  when  accessing  string  a  past  its  end). 

Data  memory  is  partitioned  in  Nkanks  (8)  memory  banks  accessed  tiiroiigii  a  common  memory  controller 
with  Nalu/'^  ports.  The  left  input  of  ALU  0  and  the  output  of  ALU  1  constitute  the  address  and  data  for 
memory  port  0,  the  left  input  of  ALU  2  and  the  output  of  ALU  3  constitute  the  address  and  data  for  memory 
port  1,  etc.  (see  figure  31).  Memory  access  takes  a  single  cycle  unless  there  is  a  bank  conflict.  When  such 
a  conflict  occurs  the  instruction  clock  freezes  until  all  the  memory  requests  are  satisfied  in  increasing  port 
order.  Such  hardware  assist  has  been  provided  because  compile  time  memory  bank  disambiguation  is  very 
difficult  in  the  case  of  non-numerical  code. 

To  execute  an  instruction  of  the  abstract  machine  ^f  the  real  machine  needs  to  determine  the  correct 
branching  address  and  perform  the  computations  on  the  selected  execution  path.  This  is  achieved  as  follows. 

Determining  the  branching  address,  hi  the  real  machine  an  instruction  is  allowed  to  branch  only  to  A^i 
(4)  target  addresses.  Each  of  the  TV,  target  addresses  is  associated  with  a  true  and  false  condition  mask 
(t-mask  and  f-mask)  which  specify  a  subset  of  the  condition  code  registers.  The  branching  address  is 
the  target  address  such  that  all  the  condition  code  registers  specified  by  the  t-mask  (f-niask)  are  set  to 
true  (false). 

Performing  computations  on  the  selected  execution  path.  Determining  the  branching  address  takes 
practically  a  whole  cycle.  As  instructions  must  execute  in  a  single  cycle,  all  the  computations  specified 
in  an  instruction  tree  are  executed  while  the  branching  address  is  determined.  The  register  stores 
however,  are  executed  conditionally.  A  transfer  enable  mask  (te-mask)  is  attached  to  each  operation's 
destination  register.  The  te-mask  has  Ni  bits,  one  for  each  target  address.  Assume  that  the  branching 
address  has  been  determined  to  be  the  itli  target  address.  Before  storing  the  result  in  a  destination 
register  the  machine  checks  whether  the  ith  bit  of  its  te-mask  is  1.   If  the  itii  bit  is  zero  the  store  is 
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(1-11 


o 

,  c:=a(s1) 
..s1;=s1  +  1 


[L2.L1) 


Q=(L2.(L11) 


P-3,L2.L1 


0-(L4.[L3.L2.L1l) 


(e) 


[L2.L1] 

o 

..  cc1:=(c=0) 
..  cc2:=(c<'A') 
,.  cc3:.(c>'Z') 
..  t:=c+CA'-'a') 
..  c':=c 

i 

..c:=a{s1) 
,.s1:=s1+1 

6 

[L3,L2.L1) 

Q=(L3,(L2.L11) 
(b) 


M.L3.L2.L11 
0 

b(s2):=c" 
s2:=s2+1 


[L3.L2.L11 
Q=empty 
(<J) 


Figure  30:  (a)  [L,]  is  generated,  (b)  [L2,  Lj]  is  generated,  (c)  [L3,  L2,  ^i]  is  generated,  (d)  [L4,  L3,  L2,  ^i]  is 
generated.  [L3,L2,Li]  cannot  execute  concurrently  with  L4,  because  of  data  dependence  conflicts  between 
s2  and  b(s2). 
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Mullponed  Register  File 


t 

ALU 
0 


1 

ALU 
2 


ALU 
3 


ALU 
n-1 


ALU 


Addr  DATA        Addr 


Pon  1  Pon  n/2 

Memory  Controller 


Memory 
BankO 


MerTX)ry 
Bank  1 


Memory 
Bank  m 


Figure  31:  Block  diagram  of  the  machine  data  path.  In  the  picture  n  stands  for  Nalu,  and  m  for  N),anks 


not  performed.  At  the  end  of  the  instruction  cycle  those  multi-cycle  operations  (such  as  floating  point 
computations)  which  are  computing  useless  results  are  aborted. 

5.2.2      Perfect  Pipelining 

Aiken  and  Nicolau  [Aiken88c]  have  presented  a  technique,  perfect  pipelining,  which  achieves  the  same  results 
as  unbounded  loop  unrolling  followed  by  global  compaction.  The  idea  behind  perfect  pipelining  is  to  move 
entire  iterations  up  without  creating  execution  gaps.  This  generates  rigid  pipelines  as  opposed  to  the  flexible 
pipelines  created  by  pipeline  scheduling.  However  while  simple  pipeline  scheduling  imposes  a  minimum 
of  1  cycle  delay  between  the  execution  of  successive  iterations,  perfect  pipelining  can  start  consecutive 
iterations  simultaneously  if  resource  and  data  dependence  constraints  permit  (loop  unrolling  followed  by 
pipeline  scheduling  can  achieve  a  similar  result).  Perfect  pipelining  proceeds  as  follows.  Let  £  be  a  single 
entry  loop. 

1.  Unroll  £  say  k  times  {k  will  be  determined  later).  Call  the  unwound  loop  £*. 

2.  Move  iteration  i  +  1  as  far  up  in  the  parallel  program  graph  of  £*  on  as  many  paths  as  possible  by 
using  the  core  transformations  of  PS.  Iteration  J '+  1  is  shifted  in  a  rigid  fashion,  that  is  no  execution 
gaps  are  allowed  in  iteration  i  +  1  (i.e.  between  the  first  and  last  operations  of  iteration  :  +  1  there 
must  be  an  operation  of  iteration  ;'+  1  scheduled  on  every  cycle).  As  iteration  2+  1  moves  through  the 
parallel  program  graph  of  £*  copies  of  operations  are  generated  where  paths  split,  effectively  creating 
other  copies  of  iteration  :  +  1.  These  copies  are  shifted  up  as  well. 

3.  When  the  shift  up  process  stops  because  none  of  the  copies  of  iteration  i  +  1  can  any  longer  move  (due 
to  resource  or  data  dependence  constraints),  the  move-up  process  is  repealed  for  iteration  i  +  2,7  + 
3, . . . ,  t  +  t.  Call  the  final  result  £move-up- 
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4.  Consider  each  execution  path  P  starting  at  the  root  of  the  parallel  program  graph  of  'C^„„f_up.  Let 
vp  be  a  vertex  of  P  containing  at  least  one  operation  from  iteration  i  +  \.  Then  for  a  large  enough  k 
there  must  exist  a  node  v'p  of  P  which  has  the  same  structure  as  vp.  Replace  all  the  edges  [u,  v'p)  in 
^move-up  parallel  program  graph  with  edges  {u,vp)  and  increment  the  loop  index  along  those  edges 
by  the  appropriate  amount. 

5.  Repeat  step  4  until  all  execution  paths  and  all  vertices  containing  at  least  one  operation  from  iteration 
i  +  1  have  been  explored,  the  delete  all  unreachable  nodes  of  jC^o^j_„p  parallel  program  graph.  This  is 
the  final  result  of  perfect  pipelining. 

6.  The  value  of  k  can  be  determined  on  the  fly.  Initially  set  k  =  2.  If  after  applying  steps  1  through  5 
there  still  exists  some  vertex  vp  (as  defined  in  4)  on  some  path  P  for  which  no  node  v'p  (of  P)  having 
the  same  structure  can  be  found,  then  we  increment  k  and  add  iteration  ;  +  k  to  the  end  of  'C^oDe-up- 
and  repeat  the  whole  process  till  convergence  (Aiken  and  Nicolau  have  proved  that  convergence  will 
sooner  or  later  arise). 

The  algorithm  has  an  exponential  worst  case  running  time  which,  according  to  Aiken  and  Nicolau,  never 
arises  in  practice.  As  we  previously  said  perfect  pipelining  produces  a  rigid  schedule  i.e.,  once  the  execution 
of  some  iteration  starts  it  will  run  to  completion  without  pausing. 

We  illustrate  perfect  pipelining  on  the  following  loop: 

FORi  IN  l.n  LOOP 

IF  t.key(i)=key  THEN 
s:=s+t.val(i); 

END  IF; 
END  LOOP: 

The  parallel  program  graph  of  the  above  loop  is  given  in  figure  32(a).  T  denotes  test  IF  t(i)=key  and  S 
statement  s:=s+t.val(i).  Loop  control  code  has  been  omitted  for  simplicity,  i  is  assumed  to  be  incremented 
on  the  loop  back  edges.  For  simplicity  we  assume  that  the  test  IF  t_key(i)=key  does  the  comparison  and 
the  branching  at  the  same  time  (this  is  not  allowed  by  the  PS  model  but  allows  to  obtain  a  final  parallel 
program  graph  of  manageable  size).  Figure  32(b)  shows  the  loop  unrolled  4  times.  We  assume  that  our 
target  machine  can  only  perform  two  conditional  tests  per  instruction.  After  pushing  up  iterations  ;'+  1  and 
i  +  2  as  much  as  po.ssible  we  obtain  the  parallel  program  graphs  of  figure  32(c)  and  32(d).  Tests  within  a 
node  are  arranged  as  shown  next  to  the  root  node  TOTl.  Once  all  the  iterations  have  been  pushed  up  we 
obtain  the  parallel  program  graph  of  32(e).  After  applying  step  4  to  the  nodes  indicated  by  the  dotted  arcs 
we  obtain  the  parallel  program  graph  of  32(f).  The  final  program  graph  produced  by  perfect  pipelining  is 
obtained  by  applying  step  4  to  the  nodes  of  32(f)  having  similar  structure.  The  final  result  is  given  in  32(g). 
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(c) 


W) 


Figure  32:  (a)  Initial  parallel  program  graph,  (b)  The  loop  unrolled  4  times,  (c)  Iteration  1+  1  has  been 
pushed  up.  (d)  Iteration  i  +  2  has  been  pushed  up.  (e)  All  4  iterations  have  been  pushed  up.  (f)  Step  4  has 
been  applied  to  the  nodes  pointed  by  the  dotted  arcs,  (g)  Final  parallel  program  graph. 
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6      Conclusion 

VLIVVs  are  able  to  achieve  high  performance  at  low  cost  thanks  to  the  hardware/compiler  symbiosis  which 
allows  relatively  inexpensive  hardware  to  be  employed. 

Two  types  of  compilation  techniques  have  been  surveyed:  global  compaction  and  software  pipelining. 
Both  techniques  are  needed  for  generation  of  efficient  code.  As  the  ratio  of  value-returning  operations  to 
conditional  jumps  is  more  or  less  fixed  in  a  given  algorithm  (lO^i  to  30%),  V'LIWs  have  to  incorporate  some 
kind  of  multi-branching  capabilities:  three  such  multi-branching  mechanisms  have  been  surveyed. 

Given  the  interesting  results  a  natural  question  to  ask  is  whether  the  VLIW  concept  scales  up.  Cycles  in 
program  data  dependence  graphs  are  serious  performance  bottlenecks,  as  the  slope  6{C)/ X{C)  of  a  cycle  C 
imposes  an  upper  bound  on  program  speedup.  This  suggests  that  there  is  a  limit  to  the  scalability  of  V'LIWs. 

Programs  contain  three  kinds  of  parallelism:  fine  grain,  coarse  grain  and  task  level.  When  all  fine  grain 
parallelism  has  been  uncovered,  higher  levels  of  parallelism  have  to  be  extracted  if  greater  speedups  are 
desired.  VLIWs  are  ill  suited  to  exploit  coarse  and  task  level  parallelism.  Parallel  processors,  however  have 
been  designed  to  exploit  such  types  of  parallelism,  but  have  a  too  high  overhead  to  use  fine  grain  parallelism 
effectively.  An  hybrid  scheme  combining  multiprocessor  and  VLIW  architectures  could  effectively  use  all 
types  of  parallelism.  A  compiler  would  automatically  extract  fine  grain  as  well  as  coarse  grain  parallelism, 
while  the  user  would  be  responsible  for  task  level  parallelism.  The  program  dependence  graph  would  be  a 
valuable  tool  throughout  the  compilation  process  as  it  allows  simultaneous  extraction  of  fine  grain  a.s  well  as 
coarse  grain  parallelism. 
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